diff --git a/Algorithm/CRF/CRF.md b/Algorithm/CRF/CRF.md
index b9a38b0..447c3a7 100644
--- a/Algorithm/CRF/CRF.md
+++ b/Algorithm/CRF/CRF.md
@@ -1,7 +1,45 @@
-# Condiitonal Random Field
+# Conditonal Random Field
 
-## Probabilistic Undirected Graphical Model (aka. Markov Random Field)
+> Can be considered to a extension of [MEM](../MEM/MEM.md)
+
+## Overview
+
+### Quick View
+
+| Category            | Usage          | Methematics | Application Field |
+| ------------------- | -------------- | ----------- | ----------------- |
+| Supervised Learning | Classification | Entropy     | NLP               |
+
+## Background - From MEM to CRF
+
+![](https://i.stack.imgur.com/khcnl.png)
+
+### Conditional Maximum Entropy Distribution
+
+## Concept
+
+### [Undirected Graph Model](../../Notes/GraphicalModel.md#Undirected-Graph-Model)
+
+## Viterbi Algorithm
 
 ## Links
 
-[Wiki - Conditional random field](https://en.wikipedia.org/wiki/Conditional_random_field)
+* [**An Introduction to Conditional Random Fields**](https://www.research.ed.ac.uk/portal/files/10482724/crftut_fnt.pdf)
+
+### Wikipedia
+
+* [Graphical model](https://en.wikipedia.org/wiki/Graphical_model)
+* [Clique (graph theory)](https://en.wikipedia.org/wiki/Clique_(graph_theory))
+* [Markov random field](https://en.wikipedia.org/wiki/Markov_random_field)
+* [Conditional random field](https://en.wikipedia.org/wiki/Conditional_random_field)
+
+### Tools
+
+* [kmkurn/pytorch-crf: (Linear-chain) Conditional random field in PyTorch.](https://github.com/kmkurn/pytorch-crf)
+  * [pytorch-crf — pytorch-crf 0.7.2 documentation](https://pytorch-crf.readthedocs.io/en/stable/)
+* [CRF++](https://taku910.github.io/crfpp/)
+  * [github](https://github.com/taku910/crfpp)
+* [TensorFlow CRF](https://www.tensorflow.org/api_docs/python/tf/contrib/crf)
+  * [github](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/crf)
+* [sklearn-crfsuite](https://sklearn-crfsuite.readthedocs.io/en/latest/)
+  * [github](https://github.com/TeamHG-Memex/sklearn-crfsuite/)
diff --git a/Algorithm/EM/EM_Iris/EM_Iris_FromScratch.py b/Algorithm/EM/EM_Iris/EM_Iris_FromScratch.py
new file mode 100644
index 0000000..e69de29
diff --git a/Algorithm/HMM/HMM_Text_Segmentation/HMMLearn.py b/Algorithm/HMM/HMM_Text_Segmentation/HMMLearn.py
new file mode 100644
index 0000000..9fca1b5
--- /dev/null
+++ b/Algorithm/HMM/HMM_Text_Segmentation/HMMLearn.py
@@ -0,0 +1,2 @@
+from hmmlearn import hmm
+
diff --git a/Algorithm/HMM/HMM_Text_Segmentation/HMM_FromScratch.py b/Algorithm/HMM/HMM_Text_Segmentation/HMM_FromScratch.py
new file mode 100644
index 0000000..d89f2dc
--- /dev/null
+++ b/Algorithm/HMM/HMM_Text_Segmentation/HMM_FromScratch.py
@@ -0,0 +1,8 @@
+import numpy as np
+
+def log_normalize(vector):
+    return np.log(vector) - np.log(np.sum(vector))
+
+def log_sum(vector):
+    pass
+    
\ No newline at end of file
diff --git a/Algorithm/LogisticRegression/LogisticRegression.md b/Algorithm/LogisticRegression/LogisticRegression.md
index 1093b2e..d58f30a 100644
--- a/Algorithm/LogisticRegression/LogisticRegression.md
+++ b/Algorithm/LogisticRegression/LogisticRegression.md
@@ -38,7 +38,7 @@ For each piece of data in the dataset:
 
 ## Multiple Classes
 
-### [Multinomial](../MEM/MEM.md) - Softmax Regression (SMR)
+### Multinomial - Softmax Regression (SMR)
 
 > Softmax Regression (synonyms: Multinomial Logistic, Maximum Entropy Classifier, or just Multi-class Logistic Regression) is a generalization of logistic regression that we can use for multi-class classification (under the assumption that the classes are mutually exclusive)
 
@@ -57,6 +57,10 @@ $$
 
 ### Book
 
+Dive into Deep Learning
+
+* [Ch3.4. Softmax Regression](http://d2l.ai/chapter_linear-networks/softmax-regression.html)
+
 Machine Learning in Action
 
 * Ch5 Logistic Regression
@@ -93,3 +97,5 @@ Multinomial (softmax)
 
 * [2 Ways to Implement Multinomial Logistic Regression in Python](http://dataaspirant.com/2017/05/15/implement-multinomial-logistic-regression-python/) - use scikit learn
 * [Machine Learning and Data Science: Multinomial (Multiclass) Logistic Regression](https://www.pugetsystems.com/labs/hpc/Machine-Learning-and-Data-Science-Multinomial-Multiclass-Logistic-Regression-1007/)
+* [mlxtend - Softmax Regression](https://rasbt.github.io/mlxtend/user_guide/classifier/SoftmaxRegression/)
+  * [jupyter notebook](https://github.com/rasbt/python-machine-learning-book/blob/master/code/bonus/softmax-regression.ipynb)
diff --git a/Algorithm/MEM/MEM.md b/Algorithm/MEM/MEM.md
new file mode 100644
index 0000000..023fc41
--- /dev/null
+++ b/Algorithm/MEM/MEM.md
@@ -0,0 +1,85 @@
+# Maximum Entropy Model
+
+Maximum Entropy Classifier / [Multinomial Logistic Regression - i.e. Softmax](../LogisticRegression/LogisticRegression.md#Multinomial---Softmax-Regression-(SMR)),
+
+> Can be considered to a mother of other algorithms
+>
+> [Condiitonal Random Field](../CRF/CRF.md)
+
+## Brief Description
+
+### Quick View
+
+Category|Usage|Methematics|Application Field
+--------|-----|-----------|-----------------
+Supervised Learning|Classification|Entropy|Many
+
+## Concept
+
+### The MEM Model
+
+#### Background
+
+Consider a machine learning problem
+
+* $x$ = $(x_1, x_2, \dots, s_m)$ is input feature vector
+* $y \in \{1, 2, \dots, k\}$ => a k classes classification problem
+
+Given k linear model for machine learning. Each has dimension of m.
+
+$$
+\phi = w_{i1}x_1 + w_{i2}x_2+\cdots + w_{im}x_m,~~~1\leq i \leq k
+$$
+
+Prediction "class" $\hat{y}$ is the maximum "score" for each linear model output.
+
+$$
+\hat{y} = \arg\max_{1\leq i \leq k} \phi_i(x)
+$$
+
+TBD
+
+
+
+
+
+### Training the Model
+
+* GIS Algorithm
+* IIS Algorithm
+* Gradient Descent
+* [Quasi-Newton Method](https://en.wikipedia.org/wiki/Quasi-Newton_method) (擬牛頓法) - L-BFGS Algorithm
+
+#### GIS Algorithm
+
+> GIS stands for Generalized Iterative Scaling
+
+#### IIS Algorithm
+
+> IIS stands for Improved Iterative Scaling. Improved from [GIS](#GIS-Algorithm)
+
+### Solving Overfitting
+
+* Feature Select: throw out rare feature
+* Feature Induction: pick useful feature (improves performance)
+* Smoothing
+
+### Feature Selection
+
+### Feature Induction
+
+### Smoothing
+
+## Application
+
+> MEM is a classification model. It's not impossible to solve the sequential labeling problem, just not so suitable.
+> For example of POS tagging, a classifier maybe not considered the global meaning information.
+
+### POS Tagging
+
+## Resources
+
+### Wikipedia
+
+* [Principle of maximum entropy - Maximum entropy models](https://en.wikipedia.org/wiki/Principle_of_maximum_entropy#Maximum_entropy_models)
+* [Multinomial logistic regression (Maximum entropy classifier)](https://en.wikipedia.org/wiki/Maximum_entropy_classifier)
diff --git a/Algorithm/NaiveBayes/NaiveBayes.md b/Algorithm/NaiveBayes/NaiveBayes.md
index e3397b5..cd6d232 100644
--- a/Algorithm/NaiveBayes/NaiveBayes.md
+++ b/Algorithm/NaiveBayes/NaiveBayes.md
@@ -2,7 +2,9 @@
 
 ## Brief Description
 
-Naive bayes are a set of supervised learning algorithms based on applying Bayes’ theorem with the “naive” assumption of conditional independence between every pair of features given the value of the class variable.
+Naive bayes are a set of supervised learning algorithms based on applying Bayes’ theorem with the “naive” assumption of *conditional independence* between every pair of features given the value of the class variable.
+
+> But in the real word, in our concept, most of the things are not conditional independence. e.g. context in NLP
 
 ### Quick View
 
@@ -21,6 +23,10 @@ Supervised Learning|Classification|Bayes' Theorem|
 
 ## Concept
 
+### Bayes Decision
+
+Posterior prob. = (Likelihood * Prior prob.) / Evidence
+
 ### Bayes' Theorem
 
 $$
@@ -35,10 +41,30 @@ $$
 
 ### Real-world conditions
 
-* We predict label by multiplying them. But if any of these probability is 0, then we will get 0 when we multiply them. To lessen the impact of this, we'll initialize all of our occurence counts to 1, and initialize the denominators to 2. (for binary classifier)
-* Another problem is **Underflow**: doing too many multiplications of small numbers. (In programming, multiply many small numbers will eventually rounds off to 0)
+* We predict label by multiplying them. But if any of these [probability is 0](#Zero-Probability-=>-Smoothing), then we will get 0 when we multiply them. To lessen the impact of this, we'll initialize all of our occurence counts to 1, and initialize the denominators to 2. (for binary classifier)
+* Another problem is **Underflow**: doing too many multiplications of small numbers. (In programming, multiply many small numbers will eventually rounds off to 0 which called **floating-point underflow**)
     * Solution 1: Take the natural logarithm of this product
 
+#### Zero Probability => Smoothing
+
+> original: $P(w_k|c_j) = \displaystyle\frac{n_k}{n}$
+
+m-estimation: $P(w_k|c_j) = \displaystyle\frac{n_k + mp}{n + m}$
+
+> additional m "virtual samples" distributed according to p
+
+## Application
+
+### Document Classification/Categorization
+
+Smoothing using Laplace smoothing (for $mp = 1$ and $m$ = Vocabulary)
+
+$$
+P(w_k|c_j) = \frac{n_k + 1}{n + |\operatorname{Vocabulary}|}
+$$
+
+### Word Sense Disambiguation
+
 ## TODO
 
 * Figure out why the log mode in predictOne function has lower accuracy when using + than using * as the origin mode. ([Line 66](NaiveBayes_Nursery/NaiveBayes_Nursery_sklearn.py))
@@ -55,5 +81,6 @@ $$
 
 ## Wikipedia
 
+* [Additive smoothing (Laplace smoothing)](https://en.wikipedia.org/wiki/Additive_smoothing)
 * [Bayesian Machine Learning](http://fastml.com/bayesian-machine-learning/)
-* [Naive Bayes Classifier](https://en.wikipedia.org/wiki/Naive_Bayes_classifier)
\ No newline at end of file
+* [Naive Bayes Classifier](https://en.wikipedia.org/wiki/Naive_Bayes_classifier)
diff --git a/Algorithm/PCA/PCA.md b/Algorithm/PCA/PCA.md
index edc19a9..e22563f 100644
--- a/Algorithm/PCA/PCA.md
+++ b/Algorithm/PCA/PCA.md
@@ -10,9 +10,9 @@ A method for doing dimensionality reduction by transforming the feature space to
 
 ### Quick View
 
-Category|Usage|Methematics|Application Field
---------|-----|-----------|-----------------
-Unsupervised Learning|Dimensionality Reduction|Orthogonal, Covariance Matrix, Eigenvalue Analysis|
+| Category              | Usage                    | Methematics                                                              | Application Field |
+| --------------------- | ------------------------ | ------------------------------------------------------------------------ | ----------------- |
+| Unsupervised Learning | Dimensionality Reduction | Orthogonal, Covariance Matrix, Eigenvalue Analysis, Lagrange Multipliers |
 
 ## Concepts
 
@@ -22,13 +22,13 @@ Steps
 
 * Take the first principal component to be in the direction of the largest variability of the data
 * The second preincipal component will be in the direction orthogonal to the first principal component
-> (We can get these values by taking the covariance matrix of the dataset and doing eigenvalue analysis on the covariance matrix)
+    > (We can get these values by taking the covariance matrix of the dataset and doing eigenvalue analysis on the covariance matrix)
 * Once we have the eigenvectors of the covariance matrix, we can take the top N eigenvectors => N most important feature
 * Multiply the data by the top N eigenvectors to transform our data into the new space
 
 Pseudocode
 
-```
+```txt
 Remove the mean
 Compute the covariance matrix
 Find the eigenvalues and eigenvectors of the covariance matrix
@@ -42,11 +42,11 @@ Transform the data into the new space created by the top N eigenvectors
 Variables
 
 * m x n matrix: $X$
-    * In practice, column vectors of $X$ are positively correlated
-    * the hypothetical factors that account for the score should be uncorrelated
+  * In practice, column vectors of $X$ are positively correlated
+  * the hypothetical factors that account for the score should be uncorrelated
 * orthogonal vectors: $\vec{y}_1, \vec{y}_2, \dots, \vec{y}_r$
-    * We require that the vectors span $R(X)$
-    * and hence the number of vectors, $r$, should be euqal to the rank of $X$
+  * We require that the vectors span $R(X)$
+  * and hence the number of vectors, $r$, should be euqal to the rank of $X$
 
 The covariance matrix is
 $$
@@ -88,11 +88,11 @@ it follows that $\vec{y_1}$ and $\vec{y_2}$ are orthogonal.
 ## Reference
 
 * Linear Algebra with Applications
-    * Ch 5 Orthogonality
-    * Ch 6 Eigenvalues
-    * Ch 6.5 Application 4 - PCA
-    * Ch 7.5 Orthogonal Transformations
-    * Ch 7.6 The Eigenvalue Problem
+  * Ch 5 Orthogonality
+  * Ch 6 Eigenvalues
+  * Ch 6.5 Application 4 - PCA
+  * Ch 7.5 Orthogonal Transformations
+  * Ch 7.6 The Eigenvalue Problem
 
 ## Links
 
@@ -101,7 +101,7 @@ it follows that $\vec{y_1}$ and $\vec{y_2}$ are orthogonal.
 ### Tutorial
 
 * [**Siraj Raval - Dimensionality Reduction**](https://www.youtube.com/watch?v=jPmV3j1dAv4)
-    * [Github](https://github.com/llSourcell/Dimensionality_Reduction)
+  * [Github](https://github.com/llSourcell/Dimensionality_Reduction)
 
 ### Scikit Learn
 
diff --git a/Algorithm/SVM/SVM.md b/Algorithm/SVM/SVM.md
index 5a879c1..9cfe542 100644
--- a/Algorithm/SVM/SVM.md
+++ b/Algorithm/SVM/SVM.md
@@ -10,9 +10,9 @@ Support Vector Machines (SVM) are learning systems that use a hypothesis space o
 
 ### Quick View
 
-Category|Usage|Methematics|Application Field
---------|-----|-----------|-----------------
-Supervised Learning|Classification (Main), Regression, Outliers Detection Clustering (Unsupervised)|Convex Optimization, Constrained Optimization, Lagrange Multipliers|Numerous
+| Category            | Usage                                                                           | Methematics                                                         | Application Field |
+| ------------------- | ------------------------------------------------------------------------------- | ------------------------------------------------------------------- | ----------------- |
+| Supervised Learning | Classification (Main), Regression, Outliers Detection Clustering (Unsupervised) | Convex Optimization, Constrained Optimization, Lagrange Multipliers | Numerous          |
 
 * Support Vector Machine is suited for extreme cases (little sample set)
 * SVM find a hyper-plane that separates its training data in such a way that the distance between the hyper plane and the cloest points form each class is maximized
@@ -39,12 +39,12 @@ Disadvantage
 * Poor performance when features >> samples
 * SVMs do not provide probability estimates
 
-SVM vs. Perceptron|SVM|Perceptron / NN
-------------------|---|----------
-**Solving Problem**|Optimization|Iteration
-**Optimal**|Global (∵ convex)|Local
-**Non-linear Seprable**|Higher dimension|Stack multi-layer model
-**Performance**|Better with prior knowledge|Skip feature engineering step
+| SVM vs. Perceptron      | SVM                         | Perceptron / NN               |
+| ----------------------- | --------------------------- | ----------------------------- |
+| **Solving Problem**     | Optimization                | Iteration                     |
+| **Optimal**             | Global (∵ convex)           | Local                         |
+| **Non-linear Seprable** | Higher dimension            | Stack multi-layer model       |
+| **Performance**         | Better with prior knowledge | Skip feature engineering step |
 
 ## Terminology
 
@@ -110,22 +110,30 @@ SVM vs. Perceptron|SVM|Perceptron / NN
 ### Kernal Function
 
 * Use a kernal trick to reduce the computational cost
-    * Kernel Function: Transform a non-linear space into a linear space
-    * Popular kernel types
-        * Linear Kernel
+  * Kernel Function: Transform a non-linear space into a linear space
+  * Popular kernel types
+    * Linear Kernel
 
-            $K(x, y) = x \times y$
+        $\kappa(x_i, x_j) = x_i^Ty_j$
 
-        * Polynomial Kernel
+        > $K(x, y) = x \times y$
+
+    * Polynomial Kernel
+
+        $\kappa(x_i, x_j) = (x_i^Ty_j)^d$
+
+        When $d = 0$ then it is linear kernel
+
+        > $K(x, y) = (x \times y + 1)^d$, $d \geq 0$
 
-            $K(x, y) = (x \times y + 1)^d$
+    * Radial Basis Function (RBF) Kernel
 
-        * Radial Basis Function (RBF) Kernel
+        $\kappa(x_i, x_j) = \exp(-\frac{||x_i - x_j||^2}{2\sigma^2})$, $\sigma > 0$ and is width of RBF kernel
 
-            $K(x, y) = e^{-\gamma ||x-y||^2}$
+        > $K(x, y) = e^{-\gamma ||x-y||^2}$
 
-        * Sigmoid Kernel
-        * ...
+    * Sigmoid Kernel
+    * ...
 
 ### Tune Parameter
 
diff --git a/Algorithm/SVM/SVMDerivation.md b/Algorithm/SVM/SVMDerivation.md
index 3c14f90..0b08640 100644
--- a/Algorithm/SVM/SVMDerivation.md
+++ b/Algorithm/SVM/SVMDerivation.md
@@ -16,7 +16,7 @@
 ### Theorem
 
 * Lagrange Duality
-    * Karush-Kuhn-Tucker (KKT)
+  * Karush-Kuhn-Tucker (KKT)
 
 ### Big Picture
 
@@ -51,6 +51,8 @@ $$|\vec{w} \cdot \vec{x} + b|$$
 We can represent our correctness of classification by:
 $$y_i (\vec{w} \cdot \vec{x} + b)$$
 
+> $$\begin{cases} \vec{w}^T\vec{x} + b \geq +1, & y_i = +1 \\ \vec{w}^T\vec{x} + b \leq -1, & y_i = -1\end{cases}$$
+
 * If classify correctly => $y_i$ and $(\vec{w} \cdot \vec{x} + b)$ will have same sign => Positive product
 * Else => Negative product
 
@@ -72,7 +74,7 @@ $$ \gamma_i = y_i \bigg(\frac{\vec{w}}{||\vec{w}||} \cdot \vec{x_i} + \frac{b}{|
 
 $$ r = \frac{\hat{r}}{||\vec{w}||} $$
 
-when ||w|| = 1 <=> functional margin = geometric margin
+when $||\vec{w}||$ = 1 <=> functional margin = geometric margin
 
 ### Maximize Margin
 
@@ -146,10 +148,12 @@ Distance between two hyperplane are called **margin**. Margin depends on normal
 Apply *Lagrange Duality*, get the optimal solution of primal problem by solving dual problem
 
 Advantage
+
 1. Dual problem is easier to solve
+   * remove the constrain
 2. Introduce kernel function, expend to non-linear classification problem
 
-For each constraint introduce a **Lagrange multiplier** $\alpha_i \geq 0$
+For each constraint introduce a [**Lagrange multiplier**](https://en.wikipedia.org/wiki/Lagrange_multiplier) $\alpha_i \geq 0$
 
 Lagrange function
 $$
@@ -220,9 +224,9 @@ It takes the large optimization problem and breaks it into many small problem.
     1. Once we have a set of alphas we can easily compute our weights w
     2. And get the separating hyperplane
 * SMO algorithm choose two alphas to optimize on each cycle. Once a sutiable pair of alphas is found, one is increased and one is decreased.
-    * Suitable criteria
-        * A pair mus meet is that both of the alphas have to be outside their margin boundary
-        * The alphas aren't already clamped or bounded
+  * Suitable criteria
+    * A pair mus meet is that both of the alphas have to be outside their margin boundary
+    * The alphas aren't already clamped or bounded
 * The reason that we have to change two alphas at the same time is because we need have a constraint $\displaystyle \sum \alpha_i * y^{(i)} = 0$
 
 ## Introduce slack variable (C parameter)
@@ -232,4 +236,5 @@ It takes the large optimization problem and breaks it into many small problem.
 ## Reference
 
 * 李航 - 統計機器學習
-* Machine Learning in Action
\ No newline at end of file
+* Machine Learning in Action
+* 機器學習
diff --git a/Algorithm/VSM/VSM_Document_Similarity/VSM_Document_Similarity_FromScratch.py b/Algorithm/VSM/VSM_Document_Similarity/VSM_Document_Similarity_FromScratch.py
new file mode 100644
index 0000000..9602589
--- /dev/null
+++ b/Algorithm/VSM/VSM_Document_Similarity/VSM_Document_Similarity_FromScratch.py
@@ -0,0 +1,89 @@
+## VSM Document Similarity From Scratch Version
+#
+# Author: David Lee
+# Create Date: 2018/12/2
+#
+# Article amount: 3443
+
+import numpy as np
+import pandas as pd
+from collections import defaultdict
+
+class Similarity:
+    def euclidianDistanceSimilarity(self, A, B):
+        return 1.0/(1.0 + np.linalg.norm(A - B))
+    def cosineSimilarity(self, A, B):
+        num = float(A.T*B)
+        denom = np.linalg.norm(A) * np.linalg.norm(B)
+        return 0.5 + 0.5 * (num/denom)
+
+class VSM_Model:
+    def __init__(self):
+        pass
+    
+    def __toDictionary(self, articleMatrix):
+        """
+        For each article, calculate a dictionary
+        i.e. word id pair
+        """
+        for article in articleMatrix:
+            pass
+
+    def __DictToCorpus(self, dictionary):
+        pass
+
+    def calcTfIdf(self):
+        pass
+
+    def getSimilarityMat(self, articleMatrix):
+        pass
+
+
+def documentTokenize(document):
+    frequencyOfWords = defaultdict(int)
+    for line in document:
+        tokens = line.strip().split()
+        if tokens: # skip empty lines
+            # Extract single word
+            for token in tokens[1:]:
+                word = token[:token.index('/')]
+                pos = token[token.index('/')+1:] # part-of-speech
+                # remove common words and meaningless words (using a stoplist)            
+                if pos not in ('w', 'y', 'u', 'c', 'p', 'k', 'm'):
+                    frequencyOfWords[word] += 1
+    return frequencyOfWords
+
+def articlesToMatrix(document, freqOfWords):
+    articleMatrix = []
+    emptyline = 0 # To seperate each article
+    wordsWeWant = []
+    articleCounter = 0
+    for line in document:
+        tokens = line.strip().split()
+        if tokens:
+            emptyline = 0
+            for token in tokens:
+                word = token[:token.index('/')]
+                # remove the words that only appear once in the corpus
+                if freqOfWords[word] > 1:
+                    wordsWeWant.append(word)
+        else:
+            emptyline += 1
+        # Found an article
+        if emptyline == 3:
+            articleMatrix.append(wordsWeWant)
+            wordsWeWant = []
+            articleCounter += 1
+    if wordsWeWant:
+        articleMatrix.append(wordsWeWant)
+        articleCounter += 1
+    print('Total articles:', articleCounter)
+    return articleMatrix
+
+def documentPreprocessing(path):
+    with open(path, 'r') as chinaNews:
+        document = chinaNews.readlines()
+    # 1. tokenize the documents
+    frequencyOfWords = documentTokenize(document)
+    # 2. calculate remain word for each article
+    return articlesToMatrix(document, frequencyOfWords)
\ No newline at end of file
diff --git a/Notes/FeatureEngineering.md b/Notes/FeatureEngineering.md
index c895921..b9b1650 100644
--- a/Notes/FeatureEngineering.md
+++ b/Notes/FeatureEngineering.md
@@ -380,6 +380,9 @@ The features themselves will work for either model. However, numerical inputs to
 
 ## Resources
 
+* [**分分鐘帶你殺入Kaggle Top 1% - 知乎**](https://zhuanlan.zhihu.com/p/27424282)
+* [如何ensemble多個神經網路？ - 知乎](https://www.zhihu.com/question/60753512/answer/184409655)
+* [**Kaggle Ensembling Guide | MLWave**](https://mlwave.com/kaggle-ensembling-guide/)
 * [Wiki - Feature engineering](https://en.wikipedia.org/wiki/Feature_engineering)
 * [機器學習筆記 - 特徵工程](https://feisky.xyz/machine-learning/basic/feature-engineering.html)
 * [**Understanding Feature Engineering (Part 1) — Continuous Numeric Data**](https://towardsdatascience.com/understanding-feature-engineering-part-1-continuous-numeric-data-da4e47099a7b)
diff --git a/Notes/GraphicalModel.md b/Notes/GraphicalModel.md
new file mode 100644
index 0000000..b67844a
--- /dev/null
+++ b/Notes/GraphicalModel.md
@@ -0,0 +1,38 @@
+# (Probabilistic) Graphical Model
+
+## Overview
+
+* Graph Models represent families of probability distribution via graphs
+  * [Directed GM](#directed-graph-model): [Bayesian Network](#dag-bayesian-network)
+    * Consider local information
+  * [Undirected GM](#undirected-graph-model): [Markov Random Field](#probabilistic-undirected-graph-model-aka-markov-random-field)
+    * Consider global information
+  * Combination GM: Chain graphs
+
+* Typical Graphical Model
+  * [Generative Model](MachineLearningBigPicture.md#Generative-Model): from Category to Data
+    * e.g. Naive Bayes
+  * [Discriminative Model](MachineLearningBigPicture.md#Discriminative-Model): from Data to Category
+
+## Directed Graph Model
+
+> [HMM](../HMM/HMM.md) is a Directed Graph Model
+
+### DAG (Bayesian Network)
+
+## Undirected Graph Model
+
+> [CRF is a Undirected Graph Model](#Probabilistic-Undirected-Graph-Model-(aka.-Markov-Random-Field))
+
+Undirected Graph Model Concept:
+
+* Clique (團): the fully connected subgraph (subset of nodes where each pair connected)
+* Maximal clique (最大團): a clique that cannot be extended by including one more adjacent vertex (no extra node can be added and remain a clique)
+
+Score Problem
+
+* Potential Function (勢函數): map clique to a positive real number (score)
+
+### Probabilistic Undirected Graph Model (aka. Markov Random Field)
+
+> CRF = Markov Random Field + conditions; C => condition distribution, RF => Markov Random Field (joint distribution)
diff --git a/Notes/MachineLearningConcepts.md b/Notes/MachineLearningConcepts.md
index b9d5ddb..30f70f5 100644
--- a/Notes/MachineLearningConcepts.md
+++ b/Notes/MachineLearningConcepts.md
@@ -24,6 +24,7 @@ Table of content
     - [Binary to Multi-class](#binary-to-multi-class)
       - [One-vs-rest (one-vs-all) Approaches](#one-vs-rest-one-vs-all-approaches)
       - [Pairwise (one-vs-one, all-vs-all) Approaches](#pairwise-one-vs-one-all-vs-all-approaches)
+    - [Multi-Labeled Classification](#multi-labeled-classification)
   - [Model Validation](#model-validation)
     - [Splitting Data](#splitting-data)
     - [Simplest Split](#simplest-split)
@@ -38,12 +39,14 @@ Table of content
     - [Classification](#classification)
       - [Accuracy (Error Rate)](#accuracy-error-rate)
       - [Confusion Matrix](#confusion-matrix)
-      - [Precision, Recall Ratio](#precision-recall-ratio)
+      - [Precision, Recall Rate](#precision-recall-rate)
+      - [Precision-Recall Curve (P-R Curve)](#precision-recall-curve-p-r-curve)
       - [ROC curve](#roc-curve)
     - [Regression](#regression)
       - [Mean Absolute Error (MAE)](#mean-absolute-error-mae)
       - [Mean Squared Error (MSE)](#mean-squared-error-mse)
       - [Root Mean Squared Error (RMSE)](#root-mean-squared-error-rmse)
+      - [Mean Absolute Percent Error (MAPE)](#mean-absolute-percent-error-mape)
     - [Clustering](#clustering)
       - [Within Groups Sum of Squares](#within-groups-sum-of-squares)
       - [Mean Silhouette Coefficient of all samples](#mean-silhouette-coefficient-of-all-samples)
@@ -70,6 +73,8 @@ Table of content
     - [Incremental Learning (Online Learning)](#incremental-learning-online-learning)
     - [Competitive Learning](#competitive-learning)
     - [Multi-label Classification](#multi-label-classification)
+  - [Other](#other)
+    - [Interpretability](#interpretability)
 
 ## Data Preprocessing
 
@@ -252,6 +257,14 @@ Tutorial:
 
 #### Pairwise (one-vs-one, all-vs-all) Approaches
 
+### Multi-Labeled Classification
+
+> Difference between multi-class classification & multi-label classification is that in multi-class problems the classes are mutually exclusive, whereas for multi-label problems each label represents a different classification task, but the tasks are somehow related..
+
+* [Multi-label classification - Wikipedia](https://en.wikipedia.org/wiki/Multi-label_classification)
+* [Deep dive into multi-label classification..! (With detailed Case Study)](https://towardsdatascience.com/journey-to-the-center-of-multi-label-classification-384c40229bff)
+* [1.12. Multiclass and multilabel algorithms — scikit-learn 0.21.3 documentation](https://scikit-learn.org/stable/modules/multiclass.html)
+
 ## Model Validation
 
 ### Splitting Data
@@ -313,68 +326,89 @@ This means that the training data will contain approximately 63.2% of the exampl
 
 ### Classification
 
+Consider a **two-class** problem.
+(Confusion matrix with different outcome labeled)
+
+| Actual \ Redicted |         +1          |         -1          |
+| :---------------: | :-----------------: | :-----------------: |
+|      **+1**       | True Positive (TP)  | False Negative (FN) |
+|      **-1**       | False Positive (FP) | True Negative (TN)  |
+
 #### Accuracy (Error Rate)
 
 * The error rate = the number of misclassified instances / the total number of instances tested.
+  * = (TP + TN) / (TP + FP + TN + FN)
 * Measuring errors this way hides how instances were misclssified.
 
+Defect:
+
+* When the data is unbalnaced/skewed the accuracy may become invalid
+  * extreme case: when there are 99% of negative samples => predect all negative will get 99% accuracy
+
 #### Confusion Matrix
 
 [**Wiki - Confusion Matrix**](https://en.wikipedia.org/wiki/Confusion_matrix)
 
+[如何辨別機器學習模型的好壞？秒懂Confusion Matrix - YC Note](https://www.ycc.idv.tw/confusion-matrix.html)
 * With a confusion matrix you get a better understanding of the classification errors.
 * If the off-diagonal elements are all zero, then you have a perfect classifier
 
 * Construct a confusion matrix: a table about Actual labels vs. Predicted label
 
-#### Precision, Recall Ratio
+#### Precision, Recall Rate
 
 These metrics that are more useful than error rate when detection of one class is more important than another class.
 
-Consider a **two-class** problem.
-(Confusion matrix with different outcome labeled)
-
-| Actual \ Redicted |         +1          |         -1          |
-| :---------------: | :-----------------: | :-----------------: |
-|      **+1**       | True Positive (TP)  | False Negative (FN) |
-|      **-1**       | False Positive (FP) | True Negative (TN)  |
-
 * **Precision** = TP / (TP + FP)
-    * Tells us the fraction of records that were positive from the group that the classifier predicted to be positive
-
+  * Tells us the fraction of records that were positive from the group that the classifier predicted to be positive
 * **Recall** = TP / (TP + FN)
-    * Measures the fraction of positive examples the classifier got right.
-    * Classifiers with a large recall dont have many positive examples classified incorectly.
+  * Measures the fraction of positive examples the classifier got right.
+  * Classifiers with a large recall dont have many positive examples classified incorectly.
+
+To improve precision, the classifier will predict a sample to be positive when "it has high confident", but this may miss many "not enough confident" positive sample, end up cause low recall rate.
 
-* **F₁ Score** = 2 × (Precision × Recall) / (Precision + Recall)
+To improve recall, the classifier will tend to look for the result which is not so popular, ...
 
-Summary:
+> Summary:
+>
+> * You can easily construct a classifier that achieves a high measure of recall or precision but not both.
+> * If you predicted everything to be in the positive class, you'd have perfect recall but poor precision.
 
-* You can easily construct a classifier that achieves a high measure of recall or precision but not both.
-* If you predicted everything to be in the positive class, you'd have perfect recall but poor precision.
+* **F1 Score** = 2 × (Precision × Recall) / (Precision + Recall)
+  * [harmonic mean](https://en.wikipedia.org/wiki/Harmonic_mean) of precision and recall
 
 Now consider **multiple classes** problem.
 
 * Macro-average
 * Micro-average
 
+In **sorting problem**: Usually use *Top N* return result to calculate precision and recall rate to measure performance
+
+* Precision@N
+* Recall@N
+
+#### Precision-Recall Curve (P-R Curve)
+
 #### ROC curve
 
 [Wiki - Receiver operating characteristic](https://en.wikipedia.org/wiki/Receiver_operating_characteristic)
 
-ROC stands for Receiver Operating Characteristic
+* x axis: False Positive Rate (FPR) = FP / (TN + FN)
+* y axis: True Positive Rate (TPR) = TP / (TP + FP)
+
+> ROC stands for Receiver Operating Characteristic
 
-* The ROC curve shows how the two rates chnge as the threshold changes
+* The ROC curve shows how the two rates (FPR & TPR) changes as the threshold changes
 * The ROC curve has two lines, a solid one and a dashed one.
-    * The solid line:
-        * the leftmost point corresponds to classifying everything as the negative class.
-        * the rightmost point corresponds to classifying everything in the positive class.
-    * The dashed line:
-        * the curve you'd get by randomly guessing.
+  * The solid line:
+    * the leftmost point corresponds to classifying everything as the negative class.
+    * the rightmost point corresponds to classifying everything in the positive class.
+  * The dashed line:
+    * the curve you'd get by randomly guessing.
 * The ROC curve can be used to compare classifiers and make cost-versus-benefit decisions.
-    * Different classifiers may perform better for different threshold values
+  * Different classifiers may perform better for *different threshold values*
 * The best classifier would be in upper left as much as possible.
-    * This would mean that you had a high true positive rate for a low false positive rate.
+  * This would mean that you had a high true positive rate for a low false positive rate.
 
 **AUC** (Area Under the Curve): A metric to compare different ROC
 
@@ -389,6 +423,20 @@ ROC stands for Receiver Operating Characteristic
 
 #### Root Mean Squared Error (RMSE)
 
+$$
+\operatorname{RMSE} = \sqrt{\frac{\sum_{i=1}^n (y_i - \hat{y}_i)^2}{n}}
+$$
+
+* If outliers (e.g. some noise points) exist, will affect the result of RMSE and make it worse
+
+#### Mean Absolute Percent Error (MAPE)
+
+$$
+\operatorname{MAPE} = \sum_{i=1}^n |\frac{y_i - \hat{y}_i}{y_i}| \times \frac{100}{n}
+$$
+
+* equivalent to normalized the error of each data point => reduce the effect caused by the outliers
+
 ### Clustering
 
 #### Within Groups Sum of Squares
@@ -574,3 +622,12 @@ Find the fastest way to minimize the error.
 ### Multi-label Classification
 
 [Wiki - Multi-label Classification](https://en.wikipedia.org/wiki/Multi-label_classification)
+
+## Other
+
+### Interpretability
+
+> 可解釋性
+
+* [要研究深度學習的可解釋性（Interpretability），應從哪幾個方面著手？](https://www.zhihu.com/question/320688440/answer/659692388)
+* [可解釋性與deep learning的發展](https://zhuanlan.zhihu.com/p/30074544)
diff --git a/Notes/MachineLearningExplainability.md b/Notes/MachineLearningExplainability.md
new file mode 100644
index 0000000..f30e971
--- /dev/null
+++ b/Notes/MachineLearningExplainability.md
@@ -0,0 +1 @@
+[Machine Learning Explainability | Kaggle](https://www.kaggle.com/learn/machine-learning-explainability)
\ No newline at end of file
diff --git a/Notes/Math/Calculus/JacobiansDerivation.md b/Notes/Math/Calculus/JacobiansDerivation.md
new file mode 100644
index 0000000..e2516fa
--- /dev/null
+++ b/Notes/Math/Calculus/JacobiansDerivation.md
@@ -0,0 +1,11 @@
+# Some special case Jacobians Derivation in Machine Learning
+
+* Elementwise activation function $\mathbf{h} = f(\mathbf{z})$
+    $$
+    \frac{\partial\mathbf{h}}{\partial\mathbf{z}}
+    $$
+* wx+b
+  * x
+  * b
+* uth
+  * u
diff --git a/Notes/Math/Derivative/NYU_ML_HW1.md b/Notes/Math/Derivative/NYU_ML_HW1.md
new file mode 100644
index 0000000..7463f65
--- /dev/null
+++ b/Notes/Math/Derivative/NYU_ML_HW1.md
@@ -0,0 +1,30 @@
+# NYU Machine Learning HW1
+
+## 1
+
+Let $\{x_1, x_2, \dots, x_n\}$ be a set of points in $d$-dimentional space. Suppose we wish to produce a single point estimate $\mu \in \mathcal{R}^d$ that minimizes the mean squared-error:
+
+$$
+\frac{1}{n} (||x_1 - \mu||^2_2 + ||x_2 - \mu||^2_2 + \dots + ||x_n - \mu||^2_2)
+$$
+
+Find a closed form expression for $\mu$ and prove that your answer is correct.
+
+### Solution
+
+> [Proof (part 1) minimizing squared error to regression line | Khan Academy - YouTube](https://www.youtube.com/watch?v=mIx2Oj5y9Q8)
+
+## 2
+
+Not all norms behave the same; for instance, the $l_1$-norm of a vector can be dramatically different from the $l_2$-norm, especially in high dimensions. Prove the following norm inequalities for $d$-dimensional vectors, starting from the definitions provided in class and lecture notes. (Use any algebraic technique/result you like, as long as you cite it.)
+
+1. $||x||_\infty \leq ||x||_2 \leq \sqrt{d}||x||_\infty$
+2. $||x||_\infty \leq ||x||_1 \leq d||x||_\infty$
+
+> [analysis - 1 and 2 norm inequality - Mathematics Stack Exchange](https://math.stackexchange.com/questions/2293778/1-and-2-norm-inequality)
+
+---
+
+p-norm
+
+infinity norm
diff --git a/Notes/Math/Derivative/homework.pdf b/Notes/Math/Derivative/homework.pdf
new file mode 100644
index 0000000..321c92e
Binary files /dev/null and b/Notes/Math/Derivative/homework.pdf differ
diff --git a/Notes/Math/LinearAlgebra/Basis.md b/Notes/Math/LinearAlgebra/Basis.md
new file mode 100644
index 0000000..a537b73
--- /dev/null
+++ b/Notes/Math/LinearAlgebra/Basis.md
@@ -0,0 +1,44 @@
+# Linear Algebra Basis
+
+## Definitions
+
+* Inner products: $\langle x, y \rangle = x^Ty$
+* Norm
+  * L2 Norm: $||x||^2 = x^Tx$
+* Distance:
+  * $d(x, y) = ||x - y||$
+  * $d_M(x, y) = \sqrt{(x-y)^T M(x-y)}$
+* Orthogonality $\langle x, y \rangle = 0$
+
+Needed to define norm preserving (i.e. orthogonal) transforms
+
+* Similarity between two vectors: $|\langle x, y \rangle| = |x||y|\cos\theta$
+  * maximum when the two vector are aligned
+  * minimum when they are orthogonal to each other
+* Transforming a vector
+  * $y = Tx$
+
+## Subspaces and bases
+
+Let $S\in R^N$ be a subset of vectors in $R^N$
+
+### Span
+
+### Bases
+
+## Eigenvectors and Eigenvalues
+
+...
+
+$$
+x^TLx = \sum (x_i - x_j)^2 = x^T(U\Lambda U^T)x \\
+= (U^Tx)\Lambda(U^Tx) = \alpha^T \Lambda \alpha
+= \sum_{k=1}^N \lambda_k\alpha_k^2
+$$
+
+> total variation
+
+## Resources
+
+* [**Immersive Math - Linear Algebra**](http://immersivemath.com/ila/index.html)
+* [Essence of Linear Algebra - YouTube](https://www.youtube.com/playlist?list=PL_w8oSr1JpVCZ5pKXHKz6PkjGCbPbSBYv)
diff --git a/Notes/Math/Probability/ProbabilityBasics.md b/Notes/Math/Probability/ProbabilityBasics.md
new file mode 100644
index 0000000..e69de29
diff --git a/Notes/Math/Topic/DistanceSimilarityMeasurement.md b/Notes/Math/Topic/DistanceSimilarityMeasurement.md
new file mode 100644
index 0000000..ce67429
--- /dev/null
+++ b/Notes/Math/Topic/DistanceSimilarityMeasurement.md
@@ -0,0 +1,65 @@
+# Distance / Similarity Measurement
+
+## Basis
+
+### Centroid
+
+## Distance
+
+### Euclidean Distance
+
+> 歐幾里德距離（歐氏距離）
+
+### Manhattan Distance
+
+> 曼哈頓距離
+
+### Chebyshev Distance
+
+### Summary of Distance
+
+## Norm
+
+* [[數學分析] 淺談各種基本範數 (Norm)](https://ch-hsieh.blogspot.com/2010/04/norm.html)
+
+### Lp Norm
+
+> Lp範式
+
+### Distance in Norm
+
+* Euclidean Distance: P = 2
+* Manhattan Distance: P = 1
+* Chebyshev Distance: P = ∞
+
+## Similarity
+
+### Cosine Similarity
+
+### Exponential Similarity
+
+### Other Distance
+
+#### Kullback-Leibler Divergence (KL-divergence)
+
+> This is not a true metric because it does not obey triangle inequality
+
+### Summary of Similarity
+
+#### Similarity vs. Distance
+
+Cosine Similarity is inversely proportional with Euclidean Distance
+
+## Other Measurement
+
+### From a Point to a Set
+
+$$
+\operatorname{dis}(x, A) = \frac{1}{|A|} ............ (haven't finish yet)
+$$
+
+### Distance between two sets
+
+* Nearest distance between Set A and Set B
+* Farthest distance between Set A and Set B
+* Average distance between Set A and Set B
diff --git a/Notes/Math/temp.txt b/Notes/Math/temp.txt
new file mode 100644
index 0000000..f0d4192
--- /dev/null
+++ b/Notes/Math/temp.txt
@@ -0,0 +1,38 @@
+机器学习中的基本数学知识
+https://www.cnblogs.com/steven-yang/p/6348112.html
+
+element-wise product/point-wise product/Hadamard product
+
+
+
+distance
+norm
+
+Distance
+
+Manhattan Distance (L1 Norm)
+https://github.com/likejazz/Siamese-LSTM/blob/master/util.py#L133
+https://medium.com/@montjoile/l0-norm-l1-norm-l2-norm-l-infinity-norm-7a7d18a4f40c
+
+Python Packages
+
+numpy
+scipy
+pandas
+matplotlib
+seaborn [seaborn: statistical data visualization — seaborn 0.9.0 documentation](https://seaborn.pydata.org/)
+sympy
+cvxpy
+cvxopt
+networkx
+[NetworkX documentation — NetworkX 1.10 documentation](https://networkx.github.io/documentation/networkx-1.10/index.html)
+
+
+https://www.cis.upenn.edu/~jean/math-deep.pdf
+
+
+[[數學分析] 什麼是若且唯若 "if and only if"](https://ch-hsieh.blogspot.com/2013/07/if-and-only-if.html)
+[[數學分析] 淺談各種基本範數 (Norm)](https://ch-hsieh.blogspot.com/2010/04/norm.html)
+
+
+p norm infinity norm
\ No newline at end of file
diff --git a/README.md b/README.md
index 302601a..b728413 100644
--- a/README.md
+++ b/README.md
@@ -38,6 +38,11 @@ For evaluation
 
 * [`surprise`](https://github.com/NicolasHug/Surprise): A Python scikit building and analyzing recommender systems
 
+For competition
+
+* [StackNet](https://github.com/kaz-Anova/StackNet): computational, scalable and analytical Meta modelling framework
+* [hyperopt](https://github.com/hyperopt/hyperopt): Distributed Asynchronous Hyperparameter Optimization in Python
+
 NLP related
 
 * [`gensim`](https://radimrehurek.com/gensim/index.html): Topic Modelling
@@ -132,6 +137,8 @@ Iris Logistic|Logistic Regression / Classification|[Iris Data Set](https://archi
     * `Gradient Boosting Decision Tree (GBDT)` (aka. Multiple Additive Regression Tree (MART))
   * [`XGBoost`](Algorithm/XGBoost/XGBoost.md)
   * [`LightGBM`](Algorithm/LightGBM/LightGBM.md)
+* Stacking
+* Blending
 
 ### NLP Related
 
@@ -333,7 +340,9 @@ Iris Logistic|Logistic Regression / Classification|[Iris Data Set](https://archi
 #### MOOC
 
 * [Stanford Andrew Ng - CS229](http://cs229.stanford.edu/)
-    * [Coursera](https://www.coursera.org/learn/machine-learning)
+  * [Coursera](https://www.coursera.org/learn/machine-learning)
+* [ML Course - Predrag Radivojac and Martha White](https://marthawhite.github.io/mlcourse/schedule.html)
+  * [Machine Learning Handbook](https://marthawhite.github.io/mlcourse/notes.pdf)
 
 ### Github
 
@@ -376,7 +385,7 @@ Textbook Implementation
 * [AI Challenger Datasets](https://challenger.ai/datasets/)
 * [Peking University Open Research Data](http://opendata.pku.edu.cn/)
 * [Open Images Dataset](https://storage.googleapis.com/openimages/web/index.html)
-    * [github](https://github.com/openimages/dataset)
+  * [github](https://github.com/openimages/dataset)
 * [Alibaba Cloud Tianchi Data Lab](https://tianchi.aliyun.com/datalab/index.htm)
 * [biendata](https://biendata.com)
 
@@ -391,6 +400,8 @@ Global
     * [CodaLab](https://competitions.codalab.org/competitions/21948)
 * [CodaLab](https://competitions.codalab.org/)
 
+> * [DataSciCamp](https://www.datascicamp.com/)
+
 Taiwan
 
 * [Open Data](https://opendata-contest.tca.org.tw/)
@@ -398,6 +409,7 @@ Taiwan
 China
 
 * [Tianchi Competition](https://tianchi.aliyun.com/competition/)
+* [FlyAI](https://www.flyai.com/)
 * [biendata](https://www.biendata.com/)
 * [SODA](http://soda.shdataic.org.cn/)
 * [Data Fountain](https://www.datafountain.cn/)
diff --git a/reminder.txt b/reminder.txt
new file mode 100644
index 0000000..02c95e0
--- /dev/null
+++ b/reminder.txt
@@ -0,0 +1,16 @@
+collecting the common concept over different algorithm to a single file with reference
+
+CRF + MEM + HMM together (and LR) maybe with some probability notes (Grpah Model summation***) Viterbi
+
+Similarity Measurement + KMeans + ML Big Picture clustering
+
+CRF + GraphicalModel + MLBigPicture
+
+Maybe partial derivative derivation on some Jacobians (which are mentioned in CS224n Lecture 3)
+
+
+
+
+
+1. Bayesian Network
+2. Github Issue about CMiFM bug, check that out