diff --git a/Algorithm/CRF/CRF.md b/Algorithm/CRF/CRF.md index b9a38b0..447c3a7 100644 --- a/Algorithm/CRF/CRF.md +++ b/Algorithm/CRF/CRF.md @@ -1,7 +1,45 @@ -# Condiitonal Random Field +# Conditonal Random Field -## Probabilistic Undirected Graphical Model (aka. Markov Random Field) +> Can be considered to a extension of [MEM](../MEM/MEM.md) + +## Overview + +### Quick View + +| Category | Usage | Methematics | Application Field | +| ------------------- | -------------- | ----------- | ----------------- | +| Supervised Learning | Classification | Entropy | NLP | + +## Background - From MEM to CRF + +![](https://i.stack.imgur.com/khcnl.png) + +### Conditional Maximum Entropy Distribution + +## Concept + +### [Undirected Graph Model](../../Notes/GraphicalModel.md#Undirected-Graph-Model) + +## Viterbi Algorithm ## Links -[Wiki - Conditional random field](https://en.wikipedia.org/wiki/Conditional_random_field) +* [**An Introduction to Conditional Random Fields**](https://www.research.ed.ac.uk/portal/files/10482724/crftut_fnt.pdf) + +### Wikipedia + +* [Graphical model](https://en.wikipedia.org/wiki/Graphical_model) +* [Clique (graph theory)](https://en.wikipedia.org/wiki/Clique_(graph_theory)) +* [Markov random field](https://en.wikipedia.org/wiki/Markov_random_field) +* [Conditional random field](https://en.wikipedia.org/wiki/Conditional_random_field) + +### Tools + +* [kmkurn/pytorch-crf: (Linear-chain) Conditional random field in PyTorch.](https://github.com/kmkurn/pytorch-crf) + * [pytorch-crf — pytorch-crf 0.7.2 documentation](https://pytorch-crf.readthedocs.io/en/stable/) +* [CRF++](https://taku910.github.io/crfpp/) + * [github](https://github.com/taku910/crfpp) +* [TensorFlow CRF](https://www.tensorflow.org/api_docs/python/tf/contrib/crf) + * [github](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/crf) +* [sklearn-crfsuite](https://sklearn-crfsuite.readthedocs.io/en/latest/) + * [github](https://github.com/TeamHG-Memex/sklearn-crfsuite/) diff --git a/Algorithm/EM/EM_Iris/EM_Iris_FromScratch.py b/Algorithm/EM/EM_Iris/EM_Iris_FromScratch.py new file mode 100644 index 0000000..e69de29 diff --git a/Algorithm/HMM/HMM_Text_Segmentation/HMMLearn.py b/Algorithm/HMM/HMM_Text_Segmentation/HMMLearn.py new file mode 100644 index 0000000..9fca1b5 --- /dev/null +++ b/Algorithm/HMM/HMM_Text_Segmentation/HMMLearn.py @@ -0,0 +1,2 @@ +from hmmlearn import hmm + diff --git a/Algorithm/HMM/HMM_Text_Segmentation/HMM_FromScratch.py b/Algorithm/HMM/HMM_Text_Segmentation/HMM_FromScratch.py new file mode 100644 index 0000000..d89f2dc --- /dev/null +++ b/Algorithm/HMM/HMM_Text_Segmentation/HMM_FromScratch.py @@ -0,0 +1,8 @@ +import numpy as np + +def log_normalize(vector): + return np.log(vector) - np.log(np.sum(vector)) + +def log_sum(vector): + pass + \ No newline at end of file diff --git a/Algorithm/LogisticRegression/LogisticRegression.md b/Algorithm/LogisticRegression/LogisticRegression.md index 1093b2e..d58f30a 100644 --- a/Algorithm/LogisticRegression/LogisticRegression.md +++ b/Algorithm/LogisticRegression/LogisticRegression.md @@ -38,7 +38,7 @@ For each piece of data in the dataset: ## Multiple Classes -### [Multinomial](../MEM/MEM.md) - Softmax Regression (SMR) +### Multinomial - Softmax Regression (SMR) > Softmax Regression (synonyms: Multinomial Logistic, Maximum Entropy Classifier, or just Multi-class Logistic Regression) is a generalization of logistic regression that we can use for multi-class classification (under the assumption that the classes are mutually exclusive) @@ -57,6 +57,10 @@ $$ ### Book +Dive into Deep Learning + +* [Ch3.4. Softmax Regression](http://d2l.ai/chapter_linear-networks/softmax-regression.html) + Machine Learning in Action * Ch5 Logistic Regression @@ -93,3 +97,5 @@ Multinomial (softmax) * [2 Ways to Implement Multinomial Logistic Regression in Python](http://dataaspirant.com/2017/05/15/implement-multinomial-logistic-regression-python/) - use scikit learn * [Machine Learning and Data Science: Multinomial (Multiclass) Logistic Regression](https://www.pugetsystems.com/labs/hpc/Machine-Learning-and-Data-Science-Multinomial-Multiclass-Logistic-Regression-1007/) +* [mlxtend - Softmax Regression](https://rasbt.github.io/mlxtend/user_guide/classifier/SoftmaxRegression/) + * [jupyter notebook](https://github.com/rasbt/python-machine-learning-book/blob/master/code/bonus/softmax-regression.ipynb) diff --git a/Algorithm/MEM/MEM.md b/Algorithm/MEM/MEM.md new file mode 100644 index 0000000..023fc41 --- /dev/null +++ b/Algorithm/MEM/MEM.md @@ -0,0 +1,85 @@ +# Maximum Entropy Model + +Maximum Entropy Classifier / [Multinomial Logistic Regression - i.e. Softmax](../LogisticRegression/LogisticRegression.md#Multinomial---Softmax-Regression-(SMR)), + +> Can be considered to a mother of other algorithms +> +> [Condiitonal Random Field](../CRF/CRF.md) + +## Brief Description + +### Quick View + +Category|Usage|Methematics|Application Field +--------|-----|-----------|----------------- +Supervised Learning|Classification|Entropy|Many + +## Concept + +### The MEM Model + +#### Background + +Consider a machine learning problem + +* $x$ = $(x_1, x_2, \dots, s_m)$ is input feature vector +* $y \in \{1, 2, \dots, k\}$ => a k classes classification problem + +Given k linear model for machine learning. Each has dimension of m. + +$$ +\phi = w_{i1}x_1 + w_{i2}x_2+\cdots + w_{im}x_m,~~~1\leq i \leq k +$$ + +Prediction "class" $\hat{y}$ is the maximum "score" for each linear model output. + +$$ +\hat{y} = \arg\max_{1\leq i \leq k} \phi_i(x) +$$ + +TBD + + + + + +### Training the Model + +* GIS Algorithm +* IIS Algorithm +* Gradient Descent +* [Quasi-Newton Method](https://en.wikipedia.org/wiki/Quasi-Newton_method) (擬牛頓法) - L-BFGS Algorithm + +#### GIS Algorithm + +> GIS stands for Generalized Iterative Scaling + +#### IIS Algorithm + +> IIS stands for Improved Iterative Scaling. Improved from [GIS](#GIS-Algorithm) + +### Solving Overfitting + +* Feature Select: throw out rare feature +* Feature Induction: pick useful feature (improves performance) +* Smoothing + +### Feature Selection + +### Feature Induction + +### Smoothing + +## Application + +> MEM is a classification model. It's not impossible to solve the sequential labeling problem, just not so suitable. +> For example of POS tagging, a classifier maybe not considered the global meaning information. + +### POS Tagging + +## Resources + +### Wikipedia + +* [Principle of maximum entropy - Maximum entropy models](https://en.wikipedia.org/wiki/Principle_of_maximum_entropy#Maximum_entropy_models) +* [Multinomial logistic regression (Maximum entropy classifier)](https://en.wikipedia.org/wiki/Maximum_entropy_classifier) diff --git a/Algorithm/NaiveBayes/NaiveBayes.md b/Algorithm/NaiveBayes/NaiveBayes.md index e3397b5..cd6d232 100644 --- a/Algorithm/NaiveBayes/NaiveBayes.md +++ b/Algorithm/NaiveBayes/NaiveBayes.md @@ -2,7 +2,9 @@ ## Brief Description -Naive bayes are a set of supervised learning algorithms based on applying Bayes’ theorem with the “naive” assumption of conditional independence between every pair of features given the value of the class variable. +Naive bayes are a set of supervised learning algorithms based on applying Bayes’ theorem with the “naive” assumption of *conditional independence* between every pair of features given the value of the class variable. + +> But in the real word, in our concept, most of the things are not conditional independence. e.g. context in NLP ### Quick View @@ -21,6 +23,10 @@ Supervised Learning|Classification|Bayes' Theorem| ## Concept +### Bayes Decision + +Posterior prob. = (Likelihood * Prior prob.) / Evidence + ### Bayes' Theorem $$ @@ -35,10 +41,30 @@ $$ ### Real-world conditions -* We predict label by multiplying them. But if any of these probability is 0, then we will get 0 when we multiply them. To lessen the impact of this, we'll initialize all of our occurence counts to 1, and initialize the denominators to 2. (for binary classifier) -* Another problem is **Underflow**: doing too many multiplications of small numbers. (In programming, multiply many small numbers will eventually rounds off to 0) +* We predict label by multiplying them. But if any of these [probability is 0](#Zero-Probability-=>-Smoothing), then we will get 0 when we multiply them. To lessen the impact of this, we'll initialize all of our occurence counts to 1, and initialize the denominators to 2. (for binary classifier) +* Another problem is **Underflow**: doing too many multiplications of small numbers. (In programming, multiply many small numbers will eventually rounds off to 0 which called **floating-point underflow**) * Solution 1: Take the natural logarithm of this product +#### Zero Probability => Smoothing + +> original: $P(w_k|c_j) = \displaystyle\frac{n_k}{n}$ + +m-estimation: $P(w_k|c_j) = \displaystyle\frac{n_k + mp}{n + m}$ + +> additional m "virtual samples" distributed according to p + +## Application + +### Document Classification/Categorization + +Smoothing using Laplace smoothing (for $mp = 1$ and $m$ = Vocabulary) + +$$ +P(w_k|c_j) = \frac{n_k + 1}{n + |\operatorname{Vocabulary}|} +$$ + +### Word Sense Disambiguation + ## TODO * Figure out why the log mode in predictOne function has lower accuracy when using + than using * as the origin mode. ([Line 66](NaiveBayes_Nursery/NaiveBayes_Nursery_sklearn.py)) @@ -55,5 +81,6 @@ $$ ## Wikipedia +* [Additive smoothing (Laplace smoothing)](https://en.wikipedia.org/wiki/Additive_smoothing) * [Bayesian Machine Learning](http://fastml.com/bayesian-machine-learning/) -* [Naive Bayes Classifier](https://en.wikipedia.org/wiki/Naive_Bayes_classifier) \ No newline at end of file +* [Naive Bayes Classifier](https://en.wikipedia.org/wiki/Naive_Bayes_classifier) diff --git a/Algorithm/PCA/PCA.md b/Algorithm/PCA/PCA.md index edc19a9..e22563f 100644 --- a/Algorithm/PCA/PCA.md +++ b/Algorithm/PCA/PCA.md @@ -10,9 +10,9 @@ A method for doing dimensionality reduction by transforming the feature space to ### Quick View -Category|Usage|Methematics|Application Field ---------|-----|-----------|----------------- -Unsupervised Learning|Dimensionality Reduction|Orthogonal, Covariance Matrix, Eigenvalue Analysis| +| Category | Usage | Methematics | Application Field | +| --------------------- | ------------------------ | ------------------------------------------------------------------------ | ----------------- | +| Unsupervised Learning | Dimensionality Reduction | Orthogonal, Covariance Matrix, Eigenvalue Analysis, Lagrange Multipliers | ## Concepts @@ -22,13 +22,13 @@ Steps * Take the first principal component to be in the direction of the largest variability of the data * The second preincipal component will be in the direction orthogonal to the first principal component -> (We can get these values by taking the covariance matrix of the dataset and doing eigenvalue analysis on the covariance matrix) + > (We can get these values by taking the covariance matrix of the dataset and doing eigenvalue analysis on the covariance matrix) * Once we have the eigenvectors of the covariance matrix, we can take the top N eigenvectors => N most important feature * Multiply the data by the top N eigenvectors to transform our data into the new space Pseudocode -``` +```txt Remove the mean Compute the covariance matrix Find the eigenvalues and eigenvectors of the covariance matrix @@ -42,11 +42,11 @@ Transform the data into the new space created by the top N eigenvectors Variables * m x n matrix: $X$ - * In practice, column vectors of $X$ are positively correlated - * the hypothetical factors that account for the score should be uncorrelated + * In practice, column vectors of $X$ are positively correlated + * the hypothetical factors that account for the score should be uncorrelated * orthogonal vectors: $\vec{y}_1, \vec{y}_2, \dots, \vec{y}_r$ - * We require that the vectors span $R(X)$ - * and hence the number of vectors, $r$, should be euqal to the rank of $X$ + * We require that the vectors span $R(X)$ + * and hence the number of vectors, $r$, should be euqal to the rank of $X$ The covariance matrix is $$ @@ -88,11 +88,11 @@ it follows that $\vec{y_1}$ and $\vec{y_2}$ are orthogonal. ## Reference * Linear Algebra with Applications - * Ch 5 Orthogonality - * Ch 6 Eigenvalues - * Ch 6.5 Application 4 - PCA - * Ch 7.5 Orthogonal Transformations - * Ch 7.6 The Eigenvalue Problem + * Ch 5 Orthogonality + * Ch 6 Eigenvalues + * Ch 6.5 Application 4 - PCA + * Ch 7.5 Orthogonal Transformations + * Ch 7.6 The Eigenvalue Problem ## Links @@ -101,7 +101,7 @@ it follows that $\vec{y_1}$ and $\vec{y_2}$ are orthogonal. ### Tutorial * [**Siraj Raval - Dimensionality Reduction**](https://www.youtube.com/watch?v=jPmV3j1dAv4) - * [Github](https://github.com/llSourcell/Dimensionality_Reduction) + * [Github](https://github.com/llSourcell/Dimensionality_Reduction) ### Scikit Learn diff --git a/Algorithm/SVM/SVM.md b/Algorithm/SVM/SVM.md index 5a879c1..9cfe542 100644 --- a/Algorithm/SVM/SVM.md +++ b/Algorithm/SVM/SVM.md @@ -10,9 +10,9 @@ Support Vector Machines (SVM) are learning systems that use a hypothesis space o ### Quick View -Category|Usage|Methematics|Application Field ---------|-----|-----------|----------------- -Supervised Learning|Classification (Main), Regression, Outliers Detection Clustering (Unsupervised)|Convex Optimization, Constrained Optimization, Lagrange Multipliers|Numerous +| Category | Usage | Methematics | Application Field | +| ------------------- | ------------------------------------------------------------------------------- | ------------------------------------------------------------------- | ----------------- | +| Supervised Learning | Classification (Main), Regression, Outliers Detection Clustering (Unsupervised) | Convex Optimization, Constrained Optimization, Lagrange Multipliers | Numerous | * Support Vector Machine is suited for extreme cases (little sample set) * SVM find a hyper-plane that separates its training data in such a way that the distance between the hyper plane and the cloest points form each class is maximized @@ -39,12 +39,12 @@ Disadvantage * Poor performance when features >> samples * SVMs do not provide probability estimates -SVM vs. Perceptron|SVM|Perceptron / NN -------------------|---|---------- -**Solving Problem**|Optimization|Iteration -**Optimal**|Global (∵ convex)|Local -**Non-linear Seprable**|Higher dimension|Stack multi-layer model -**Performance**|Better with prior knowledge|Skip feature engineering step +| SVM vs. Perceptron | SVM | Perceptron / NN | +| ----------------------- | --------------------------- | ----------------------------- | +| **Solving Problem** | Optimization | Iteration | +| **Optimal** | Global (∵ convex) | Local | +| **Non-linear Seprable** | Higher dimension | Stack multi-layer model | +| **Performance** | Better with prior knowledge | Skip feature engineering step | ## Terminology @@ -110,22 +110,30 @@ SVM vs. Perceptron|SVM|Perceptron / NN ### Kernal Function * Use a kernal trick to reduce the computational cost - * Kernel Function: Transform a non-linear space into a linear space - * Popular kernel types - * Linear Kernel + * Kernel Function: Transform a non-linear space into a linear space + * Popular kernel types + * Linear Kernel - $K(x, y) = x \times y$ + $\kappa(x_i, x_j) = x_i^Ty_j$ - * Polynomial Kernel + > $K(x, y) = x \times y$ + + * Polynomial Kernel + + $\kappa(x_i, x_j) = (x_i^Ty_j)^d$ + + When $d = 0$ then it is linear kernel + + > $K(x, y) = (x \times y + 1)^d$, $d \geq 0$ - $K(x, y) = (x \times y + 1)^d$ + * Radial Basis Function (RBF) Kernel - * Radial Basis Function (RBF) Kernel + $\kappa(x_i, x_j) = \exp(-\frac{||x_i - x_j||^2}{2\sigma^2})$, $\sigma > 0$ and is width of RBF kernel - $K(x, y) = e^{-\gamma ||x-y||^2}$ + > $K(x, y) = e^{-\gamma ||x-y||^2}$ - * Sigmoid Kernel - * ... + * Sigmoid Kernel + * ... ### Tune Parameter diff --git a/Algorithm/SVM/SVMDerivation.md b/Algorithm/SVM/SVMDerivation.md index 3c14f90..0b08640 100644 --- a/Algorithm/SVM/SVMDerivation.md +++ b/Algorithm/SVM/SVMDerivation.md @@ -16,7 +16,7 @@ ### Theorem * Lagrange Duality - * Karush-Kuhn-Tucker (KKT) + * Karush-Kuhn-Tucker (KKT) ### Big Picture @@ -51,6 +51,8 @@ $$|\vec{w} \cdot \vec{x} + b|$$ We can represent our correctness of classification by: $$y_i (\vec{w} \cdot \vec{x} + b)$$ +> $$\begin{cases} \vec{w}^T\vec{x} + b \geq +1, & y_i = +1 \\ \vec{w}^T\vec{x} + b \leq -1, & y_i = -1\end{cases}$$ + * If classify correctly => $y_i$ and $(\vec{w} \cdot \vec{x} + b)$ will have same sign => Positive product * Else => Negative product @@ -72,7 +74,7 @@ $$ \gamma_i = y_i \bigg(\frac{\vec{w}}{||\vec{w}||} \cdot \vec{x_i} + \frac{b}{| $$ r = \frac{\hat{r}}{||\vec{w}||} $$ -when ||w|| = 1 <=> functional margin = geometric margin +when $||\vec{w}||$ = 1 <=> functional margin = geometric margin ### Maximize Margin @@ -146,10 +148,12 @@ Distance between two hyperplane are called **margin**. Margin depends on normal Apply *Lagrange Duality*, get the optimal solution of primal problem by solving dual problem Advantage + 1. Dual problem is easier to solve + * remove the constrain 2. Introduce kernel function, expend to non-linear classification problem -For each constraint introduce a **Lagrange multiplier** $\alpha_i \geq 0$ +For each constraint introduce a [**Lagrange multiplier**](https://en.wikipedia.org/wiki/Lagrange_multiplier) $\alpha_i \geq 0$ Lagrange function $$ @@ -220,9 +224,9 @@ It takes the large optimization problem and breaks it into many small problem. 1. Once we have a set of alphas we can easily compute our weights w 2. And get the separating hyperplane * SMO algorithm choose two alphas to optimize on each cycle. Once a sutiable pair of alphas is found, one is increased and one is decreased. - * Suitable criteria - * A pair mus meet is that both of the alphas have to be outside their margin boundary - * The alphas aren't already clamped or bounded + * Suitable criteria + * A pair mus meet is that both of the alphas have to be outside their margin boundary + * The alphas aren't already clamped or bounded * The reason that we have to change two alphas at the same time is because we need have a constraint $\displaystyle \sum \alpha_i * y^{(i)} = 0$ ## Introduce slack variable (C parameter) @@ -232,4 +236,5 @@ It takes the large optimization problem and breaks it into many small problem. ## Reference * 李航 - 統計機器學習 -* Machine Learning in Action \ No newline at end of file +* Machine Learning in Action +* 機器學習 diff --git a/Algorithm/VSM/VSM_Document_Similarity/VSM_Document_Similarity_FromScratch.py b/Algorithm/VSM/VSM_Document_Similarity/VSM_Document_Similarity_FromScratch.py new file mode 100644 index 0000000..9602589 --- /dev/null +++ b/Algorithm/VSM/VSM_Document_Similarity/VSM_Document_Similarity_FromScratch.py @@ -0,0 +1,89 @@ +## VSM Document Similarity From Scratch Version +# +# Author: David Lee +# Create Date: 2018/12/2 +# +# Article amount: 3443 + +import numpy as np +import pandas as pd +from collections import defaultdict + +class Similarity: + def euclidianDistanceSimilarity(self, A, B): + return 1.0/(1.0 + np.linalg.norm(A - B)) + def cosineSimilarity(self, A, B): + num = float(A.T*B) + denom = np.linalg.norm(A) * np.linalg.norm(B) + return 0.5 + 0.5 * (num/denom) + +class VSM_Model: + def __init__(self): + pass + + def __toDictionary(self, articleMatrix): + """ + For each article, calculate a dictionary + i.e. word id pair + """ + for article in articleMatrix: + pass + + def __DictToCorpus(self, dictionary): + pass + + def calcTfIdf(self): + pass + + def getSimilarityMat(self, articleMatrix): + pass + + +def documentTokenize(document): + frequencyOfWords = defaultdict(int) + for line in document: + tokens = line.strip().split() + if tokens: # skip empty lines + # Extract single word + for token in tokens[1:]: + word = token[:token.index('/')] + pos = token[token.index('/')+1:] # part-of-speech + # remove common words and meaningless words (using a stoplist) + if pos not in ('w', 'y', 'u', 'c', 'p', 'k', 'm'): + frequencyOfWords[word] += 1 + return frequencyOfWords + +def articlesToMatrix(document, freqOfWords): + articleMatrix = [] + emptyline = 0 # To seperate each article + wordsWeWant = [] + articleCounter = 0 + for line in document: + tokens = line.strip().split() + if tokens: + emptyline = 0 + for token in tokens: + word = token[:token.index('/')] + # remove the words that only appear once in the corpus + if freqOfWords[word] > 1: + wordsWeWant.append(word) + else: + emptyline += 1 + # Found an article + if emptyline == 3: + articleMatrix.append(wordsWeWant) + wordsWeWant = [] + articleCounter += 1 + if wordsWeWant: + articleMatrix.append(wordsWeWant) + articleCounter += 1 + print('Total articles:', articleCounter) + return articleMatrix + +def documentPreprocessing(path): + with open(path, 'r') as chinaNews: + document = chinaNews.readlines() + # 1. tokenize the documents + frequencyOfWords = documentTokenize(document) + # 2. calculate remain word for each article + return articlesToMatrix(document, frequencyOfWords) \ No newline at end of file diff --git a/Notes/FeatureEngineering.md b/Notes/FeatureEngineering.md index c895921..b9b1650 100644 --- a/Notes/FeatureEngineering.md +++ b/Notes/FeatureEngineering.md @@ -380,6 +380,9 @@ The features themselves will work for either model. However, numerical inputs to ## Resources +* [**分分鐘帶你殺入Kaggle Top 1% - 知乎**](https://zhuanlan.zhihu.com/p/27424282) +* [如何ensemble多個神經網路? - 知乎](https://www.zhihu.com/question/60753512/answer/184409655) +* [**Kaggle Ensembling Guide | MLWave**](https://mlwave.com/kaggle-ensembling-guide/) * [Wiki - Feature engineering](https://en.wikipedia.org/wiki/Feature_engineering) * [機器學習筆記 - 特徵工程](https://feisky.xyz/machine-learning/basic/feature-engineering.html) * [**Understanding Feature Engineering (Part 1) — Continuous Numeric Data**](https://towardsdatascience.com/understanding-feature-engineering-part-1-continuous-numeric-data-da4e47099a7b) diff --git a/Notes/GraphicalModel.md b/Notes/GraphicalModel.md new file mode 100644 index 0000000..b67844a --- /dev/null +++ b/Notes/GraphicalModel.md @@ -0,0 +1,38 @@ +# (Probabilistic) Graphical Model + +## Overview + +* Graph Models represent families of probability distribution via graphs + * [Directed GM](#directed-graph-model): [Bayesian Network](#dag-bayesian-network) + * Consider local information + * [Undirected GM](#undirected-graph-model): [Markov Random Field](#probabilistic-undirected-graph-model-aka-markov-random-field) + * Consider global information + * Combination GM: Chain graphs + +* Typical Graphical Model + * [Generative Model](MachineLearningBigPicture.md#Generative-Model): from Category to Data + * e.g. Naive Bayes + * [Discriminative Model](MachineLearningBigPicture.md#Discriminative-Model): from Data to Category + +## Directed Graph Model + +> [HMM](../HMM/HMM.md) is a Directed Graph Model + +### DAG (Bayesian Network) + +## Undirected Graph Model + +> [CRF is a Undirected Graph Model](#Probabilistic-Undirected-Graph-Model-(aka.-Markov-Random-Field)) + +Undirected Graph Model Concept: + +* Clique (團): the fully connected subgraph (subset of nodes where each pair connected) +* Maximal clique (最大團): a clique that cannot be extended by including one more adjacent vertex (no extra node can be added and remain a clique) + +Score Problem + +* Potential Function (勢函數): map clique to a positive real number (score) + +### Probabilistic Undirected Graph Model (aka. Markov Random Field) + +> CRF = Markov Random Field + conditions; C => condition distribution, RF => Markov Random Field (joint distribution) diff --git a/Notes/MachineLearningConcepts.md b/Notes/MachineLearningConcepts.md index b9d5ddb..30f70f5 100644 --- a/Notes/MachineLearningConcepts.md +++ b/Notes/MachineLearningConcepts.md @@ -24,6 +24,7 @@ Table of content - [Binary to Multi-class](#binary-to-multi-class) - [One-vs-rest (one-vs-all) Approaches](#one-vs-rest-one-vs-all-approaches) - [Pairwise (one-vs-one, all-vs-all) Approaches](#pairwise-one-vs-one-all-vs-all-approaches) + - [Multi-Labeled Classification](#multi-labeled-classification) - [Model Validation](#model-validation) - [Splitting Data](#splitting-data) - [Simplest Split](#simplest-split) @@ -38,12 +39,14 @@ Table of content - [Classification](#classification) - [Accuracy (Error Rate)](#accuracy-error-rate) - [Confusion Matrix](#confusion-matrix) - - [Precision, Recall Ratio](#precision-recall-ratio) + - [Precision, Recall Rate](#precision-recall-rate) + - [Precision-Recall Curve (P-R Curve)](#precision-recall-curve-p-r-curve) - [ROC curve](#roc-curve) - [Regression](#regression) - [Mean Absolute Error (MAE)](#mean-absolute-error-mae) - [Mean Squared Error (MSE)](#mean-squared-error-mse) - [Root Mean Squared Error (RMSE)](#root-mean-squared-error-rmse) + - [Mean Absolute Percent Error (MAPE)](#mean-absolute-percent-error-mape) - [Clustering](#clustering) - [Within Groups Sum of Squares](#within-groups-sum-of-squares) - [Mean Silhouette Coefficient of all samples](#mean-silhouette-coefficient-of-all-samples) @@ -70,6 +73,8 @@ Table of content - [Incremental Learning (Online Learning)](#incremental-learning-online-learning) - [Competitive Learning](#competitive-learning) - [Multi-label Classification](#multi-label-classification) + - [Other](#other) + - [Interpretability](#interpretability) ## Data Preprocessing @@ -252,6 +257,14 @@ Tutorial: #### Pairwise (one-vs-one, all-vs-all) Approaches +### Multi-Labeled Classification + +> Difference between multi-class classification & multi-label classification is that in multi-class problems the classes are mutually exclusive, whereas for multi-label problems each label represents a different classification task, but the tasks are somehow related.. + +* [Multi-label classification - Wikipedia](https://en.wikipedia.org/wiki/Multi-label_classification) +* [Deep dive into multi-label classification..! (With detailed Case Study)](https://towardsdatascience.com/journey-to-the-center-of-multi-label-classification-384c40229bff) +* [1.12. Multiclass and multilabel algorithms — scikit-learn 0.21.3 documentation](https://scikit-learn.org/stable/modules/multiclass.html) + ## Model Validation ### Splitting Data @@ -313,68 +326,89 @@ This means that the training data will contain approximately 63.2% of the exampl ### Classification +Consider a **two-class** problem. +(Confusion matrix with different outcome labeled) + +| Actual \ Redicted | +1 | -1 | +| :---------------: | :-----------------: | :-----------------: | +| **+1** | True Positive (TP) | False Negative (FN) | +| **-1** | False Positive (FP) | True Negative (TN) | + #### Accuracy (Error Rate) * The error rate = the number of misclassified instances / the total number of instances tested. + * = (TP + TN) / (TP + FP + TN + FN) * Measuring errors this way hides how instances were misclssified. +Defect: + +* When the data is unbalnaced/skewed the accuracy may become invalid + * extreme case: when there are 99% of negative samples => predect all negative will get 99% accuracy + #### Confusion Matrix [**Wiki - Confusion Matrix**](https://en.wikipedia.org/wiki/Confusion_matrix) +[如何辨別機器學習模型的好壞?秒懂Confusion Matrix - YC Note](https://www.ycc.idv.tw/confusion-matrix.html) * With a confusion matrix you get a better understanding of the classification errors. * If the off-diagonal elements are all zero, then you have a perfect classifier * Construct a confusion matrix: a table about Actual labels vs. Predicted label -#### Precision, Recall Ratio +#### Precision, Recall Rate These metrics that are more useful than error rate when detection of one class is more important than another class. -Consider a **two-class** problem. -(Confusion matrix with different outcome labeled) - -| Actual \ Redicted | +1 | -1 | -| :---------------: | :-----------------: | :-----------------: | -| **+1** | True Positive (TP) | False Negative (FN) | -| **-1** | False Positive (FP) | True Negative (TN) | - * **Precision** = TP / (TP + FP) - * Tells us the fraction of records that were positive from the group that the classifier predicted to be positive - + * Tells us the fraction of records that were positive from the group that the classifier predicted to be positive * **Recall** = TP / (TP + FN) - * Measures the fraction of positive examples the classifier got right. - * Classifiers with a large recall dont have many positive examples classified incorectly. + * Measures the fraction of positive examples the classifier got right. + * Classifiers with a large recall dont have many positive examples classified incorectly. + +To improve precision, the classifier will predict a sample to be positive when "it has high confident", but this may miss many "not enough confident" positive sample, end up cause low recall rate. -* **F₁ Score** = 2 × (Precision × Recall) / (Precision + Recall) +To improve recall, the classifier will tend to look for the result which is not so popular, ... -Summary: +> Summary: +> +> * You can easily construct a classifier that achieves a high measure of recall or precision but not both. +> * If you predicted everything to be in the positive class, you'd have perfect recall but poor precision. -* You can easily construct a classifier that achieves a high measure of recall or precision but not both. -* If you predicted everything to be in the positive class, you'd have perfect recall but poor precision. +* **F1 Score** = 2 × (Precision × Recall) / (Precision + Recall) + * [harmonic mean](https://en.wikipedia.org/wiki/Harmonic_mean) of precision and recall Now consider **multiple classes** problem. * Macro-average * Micro-average +In **sorting problem**: Usually use *Top N* return result to calculate precision and recall rate to measure performance + +* Precision@N +* Recall@N + +#### Precision-Recall Curve (P-R Curve) + #### ROC curve [Wiki - Receiver operating characteristic](https://en.wikipedia.org/wiki/Receiver_operating_characteristic) -ROC stands for Receiver Operating Characteristic +* x axis: False Positive Rate (FPR) = FP / (TN + FN) +* y axis: True Positive Rate (TPR) = TP / (TP + FP) + +> ROC stands for Receiver Operating Characteristic -* The ROC curve shows how the two rates chnge as the threshold changes +* The ROC curve shows how the two rates (FPR & TPR) changes as the threshold changes * The ROC curve has two lines, a solid one and a dashed one. - * The solid line: - * the leftmost point corresponds to classifying everything as the negative class. - * the rightmost point corresponds to classifying everything in the positive class. - * The dashed line: - * the curve you'd get by randomly guessing. + * The solid line: + * the leftmost point corresponds to classifying everything as the negative class. + * the rightmost point corresponds to classifying everything in the positive class. + * The dashed line: + * the curve you'd get by randomly guessing. * The ROC curve can be used to compare classifiers and make cost-versus-benefit decisions. - * Different classifiers may perform better for different threshold values + * Different classifiers may perform better for *different threshold values* * The best classifier would be in upper left as much as possible. - * This would mean that you had a high true positive rate for a low false positive rate. + * This would mean that you had a high true positive rate for a low false positive rate. **AUC** (Area Under the Curve): A metric to compare different ROC @@ -389,6 +423,20 @@ ROC stands for Receiver Operating Characteristic #### Root Mean Squared Error (RMSE) +$$ +\operatorname{RMSE} = \sqrt{\frac{\sum_{i=1}^n (y_i - \hat{y}_i)^2}{n}} +$$ + +* If outliers (e.g. some noise points) exist, will affect the result of RMSE and make it worse + +#### Mean Absolute Percent Error (MAPE) + +$$ +\operatorname{MAPE} = \sum_{i=1}^n |\frac{y_i - \hat{y}_i}{y_i}| \times \frac{100}{n} +$$ + +* equivalent to normalized the error of each data point => reduce the effect caused by the outliers + ### Clustering #### Within Groups Sum of Squares @@ -574,3 +622,12 @@ Find the fastest way to minimize the error. ### Multi-label Classification [Wiki - Multi-label Classification](https://en.wikipedia.org/wiki/Multi-label_classification) + +## Other + +### Interpretability + +> 可解釋性 + +* [要研究深度學習的可解釋性(Interpretability),應從哪幾個方面著手?](https://www.zhihu.com/question/320688440/answer/659692388) +* [可解釋性與deep learning的發展](https://zhuanlan.zhihu.com/p/30074544) diff --git a/Notes/MachineLearningExplainability.md b/Notes/MachineLearningExplainability.md new file mode 100644 index 0000000..f30e971 --- /dev/null +++ b/Notes/MachineLearningExplainability.md @@ -0,0 +1 @@ +[Machine Learning Explainability | Kaggle](https://www.kaggle.com/learn/machine-learning-explainability) \ No newline at end of file diff --git a/Notes/Math/Calculus/JacobiansDerivation.md b/Notes/Math/Calculus/JacobiansDerivation.md new file mode 100644 index 0000000..e2516fa --- /dev/null +++ b/Notes/Math/Calculus/JacobiansDerivation.md @@ -0,0 +1,11 @@ +# Some special case Jacobians Derivation in Machine Learning + +* Elementwise activation function $\mathbf{h} = f(\mathbf{z})$ + $$ + \frac{\partial\mathbf{h}}{\partial\mathbf{z}} + $$ +* wx+b + * x + * b +* uth + * u diff --git a/Notes/Math/Derivative/NYU_ML_HW1.md b/Notes/Math/Derivative/NYU_ML_HW1.md new file mode 100644 index 0000000..7463f65 --- /dev/null +++ b/Notes/Math/Derivative/NYU_ML_HW1.md @@ -0,0 +1,30 @@ +# NYU Machine Learning HW1 + +## 1 + +Let $\{x_1, x_2, \dots, x_n\}$ be a set of points in $d$-dimentional space. Suppose we wish to produce a single point estimate $\mu \in \mathcal{R}^d$ that minimizes the mean squared-error: + +$$ +\frac{1}{n} (||x_1 - \mu||^2_2 + ||x_2 - \mu||^2_2 + \dots + ||x_n - \mu||^2_2) +$$ + +Find a closed form expression for $\mu$ and prove that your answer is correct. + +### Solution + +> [Proof (part 1) minimizing squared error to regression line | Khan Academy - YouTube](https://www.youtube.com/watch?v=mIx2Oj5y9Q8) + +## 2 + +Not all norms behave the same; for instance, the $l_1$-norm of a vector can be dramatically different from the $l_2$-norm, especially in high dimensions. Prove the following norm inequalities for $d$-dimensional vectors, starting from the definitions provided in class and lecture notes. (Use any algebraic technique/result you like, as long as you cite it.) + +1. $||x||_\infty \leq ||x||_2 \leq \sqrt{d}||x||_\infty$ +2. $||x||_\infty \leq ||x||_1 \leq d||x||_\infty$ + +> [analysis - 1 and 2 norm inequality - Mathematics Stack Exchange](https://math.stackexchange.com/questions/2293778/1-and-2-norm-inequality) + +--- + +p-norm + +infinity norm diff --git a/Notes/Math/Derivative/homework.pdf b/Notes/Math/Derivative/homework.pdf new file mode 100644 index 0000000..321c92e Binary files /dev/null and b/Notes/Math/Derivative/homework.pdf differ diff --git a/Notes/Math/LinearAlgebra/Basis.md b/Notes/Math/LinearAlgebra/Basis.md new file mode 100644 index 0000000..a537b73 --- /dev/null +++ b/Notes/Math/LinearAlgebra/Basis.md @@ -0,0 +1,44 @@ +# Linear Algebra Basis + +## Definitions + +* Inner products: $\langle x, y \rangle = x^Ty$ +* Norm + * L2 Norm: $||x||^2 = x^Tx$ +* Distance: + * $d(x, y) = ||x - y||$ + * $d_M(x, y) = \sqrt{(x-y)^T M(x-y)}$ +* Orthogonality $\langle x, y \rangle = 0$ + +Needed to define norm preserving (i.e. orthogonal) transforms + +* Similarity between two vectors: $|\langle x, y \rangle| = |x||y|\cos\theta$ + * maximum when the two vector are aligned + * minimum when they are orthogonal to each other +* Transforming a vector + * $y = Tx$ + +## Subspaces and bases + +Let $S\in R^N$ be a subset of vectors in $R^N$ + +### Span + +### Bases + +## Eigenvectors and Eigenvalues + +... + +$$ +x^TLx = \sum (x_i - x_j)^2 = x^T(U\Lambda U^T)x \\ += (U^Tx)\Lambda(U^Tx) = \alpha^T \Lambda \alpha += \sum_{k=1}^N \lambda_k\alpha_k^2 +$$ + +> total variation + +## Resources + +* [**Immersive Math - Linear Algebra**](http://immersivemath.com/ila/index.html) +* [Essence of Linear Algebra - YouTube](https://www.youtube.com/playlist?list=PL_w8oSr1JpVCZ5pKXHKz6PkjGCbPbSBYv) diff --git a/Notes/Math/Probability/ProbabilityBasics.md b/Notes/Math/Probability/ProbabilityBasics.md new file mode 100644 index 0000000..e69de29 diff --git a/Notes/Math/Topic/DistanceSimilarityMeasurement.md b/Notes/Math/Topic/DistanceSimilarityMeasurement.md new file mode 100644 index 0000000..ce67429 --- /dev/null +++ b/Notes/Math/Topic/DistanceSimilarityMeasurement.md @@ -0,0 +1,65 @@ +# Distance / Similarity Measurement + +## Basis + +### Centroid + +## Distance + +### Euclidean Distance + +> 歐幾里德距離(歐氏距離) + +### Manhattan Distance + +> 曼哈頓距離 + +### Chebyshev Distance + +### Summary of Distance + +## Norm + +* [[數學分析] 淺談各種基本範數 (Norm)](https://ch-hsieh.blogspot.com/2010/04/norm.html) + +### Lp Norm + +> Lp範式 + +### Distance in Norm + +* Euclidean Distance: P = 2 +* Manhattan Distance: P = 1 +* Chebyshev Distance: P = ∞ + +## Similarity + +### Cosine Similarity + +### Exponential Similarity + +### Other Distance + +#### Kullback-Leibler Divergence (KL-divergence) + +> This is not a true metric because it does not obey triangle inequality + +### Summary of Similarity + +#### Similarity vs. Distance + +Cosine Similarity is inversely proportional with Euclidean Distance + +## Other Measurement + +### From a Point to a Set + +$$ +\operatorname{dis}(x, A) = \frac{1}{|A|} ............ (haven't finish yet) +$$ + +### Distance between two sets + +* Nearest distance between Set A and Set B +* Farthest distance between Set A and Set B +* Average distance between Set A and Set B diff --git a/Notes/Math/temp.txt b/Notes/Math/temp.txt new file mode 100644 index 0000000..f0d4192 --- /dev/null +++ b/Notes/Math/temp.txt @@ -0,0 +1,38 @@ +机器学习中的基本数学知识 +https://www.cnblogs.com/steven-yang/p/6348112.html + +element-wise product/point-wise product/Hadamard product + + + +distance +norm + +Distance + +Manhattan Distance (L1 Norm) +https://github.com/likejazz/Siamese-LSTM/blob/master/util.py#L133 +https://medium.com/@montjoile/l0-norm-l1-norm-l2-norm-l-infinity-norm-7a7d18a4f40c + +Python Packages + +numpy +scipy +pandas +matplotlib +seaborn [seaborn: statistical data visualization — seaborn 0.9.0 documentation](https://seaborn.pydata.org/) +sympy +cvxpy +cvxopt +networkx +[NetworkX documentation — NetworkX 1.10 documentation](https://networkx.github.io/documentation/networkx-1.10/index.html) + + +https://www.cis.upenn.edu/~jean/math-deep.pdf + + +[[數學分析] 什麼是若且唯若 "if and only if"](https://ch-hsieh.blogspot.com/2013/07/if-and-only-if.html) +[[數學分析] 淺談各種基本範數 (Norm)](https://ch-hsieh.blogspot.com/2010/04/norm.html) + + +p norm infinity norm \ No newline at end of file diff --git a/README.md b/README.md index 302601a..b728413 100644 --- a/README.md +++ b/README.md @@ -38,6 +38,11 @@ For evaluation * [`surprise`](https://github.com/NicolasHug/Surprise): A Python scikit building and analyzing recommender systems +For competition + +* [StackNet](https://github.com/kaz-Anova/StackNet): computational, scalable and analytical Meta modelling framework +* [hyperopt](https://github.com/hyperopt/hyperopt): Distributed Asynchronous Hyperparameter Optimization in Python + NLP related * [`gensim`](https://radimrehurek.com/gensim/index.html): Topic Modelling @@ -132,6 +137,8 @@ Iris Logistic|Logistic Regression / Classification|[Iris Data Set](https://archi * `Gradient Boosting Decision Tree (GBDT)` (aka. Multiple Additive Regression Tree (MART)) * [`XGBoost`](Algorithm/XGBoost/XGBoost.md) * [`LightGBM`](Algorithm/LightGBM/LightGBM.md) +* Stacking +* Blending ### NLP Related @@ -333,7 +340,9 @@ Iris Logistic|Logistic Regression / Classification|[Iris Data Set](https://archi #### MOOC * [Stanford Andrew Ng - CS229](http://cs229.stanford.edu/) - * [Coursera](https://www.coursera.org/learn/machine-learning) + * [Coursera](https://www.coursera.org/learn/machine-learning) +* [ML Course - Predrag Radivojac and Martha White](https://marthawhite.github.io/mlcourse/schedule.html) + * [Machine Learning Handbook](https://marthawhite.github.io/mlcourse/notes.pdf) ### Github @@ -376,7 +385,7 @@ Textbook Implementation * [AI Challenger Datasets](https://challenger.ai/datasets/) * [Peking University Open Research Data](http://opendata.pku.edu.cn/) * [Open Images Dataset](https://storage.googleapis.com/openimages/web/index.html) - * [github](https://github.com/openimages/dataset) + * [github](https://github.com/openimages/dataset) * [Alibaba Cloud Tianchi Data Lab](https://tianchi.aliyun.com/datalab/index.htm) * [biendata](https://biendata.com) @@ -391,6 +400,8 @@ Global * [CodaLab](https://competitions.codalab.org/competitions/21948) * [CodaLab](https://competitions.codalab.org/) +> * [DataSciCamp](https://www.datascicamp.com/) + Taiwan * [Open Data](https://opendata-contest.tca.org.tw/) @@ -398,6 +409,7 @@ Taiwan China * [Tianchi Competition](https://tianchi.aliyun.com/competition/) +* [FlyAI](https://www.flyai.com/) * [biendata](https://www.biendata.com/) * [SODA](http://soda.shdataic.org.cn/) * [Data Fountain](https://www.datafountain.cn/) diff --git a/reminder.txt b/reminder.txt new file mode 100644 index 0000000..02c95e0 --- /dev/null +++ b/reminder.txt @@ -0,0 +1,16 @@ +collecting the common concept over different algorithm to a single file with reference + +CRF + MEM + HMM together (and LR) maybe with some probability notes (Grpah Model summation***) Viterbi + +Similarity Measurement + KMeans + ML Big Picture clustering + +CRF + GraphicalModel + MLBigPicture + +Maybe partial derivative derivation on some Jacobians (which are mentioned in CS224n Lecture 3) + + + + + +1. Bayesian Network +2. Github Issue about CMiFM bug, check that out