diff --git a/10_Trials.png b/10_Trials.png new file mode 100644 index 0000000..1285766 Binary files /dev/null and b/10_Trials.png differ diff --git a/5000_Trials.png b/5000_Trials.png new file mode 100644 index 0000000..a9db115 Binary files /dev/null and b/5000_Trials.png differ diff --git a/Questions.txt b/Questions.txt new file mode 100644 index 0000000..ca9658e --- /dev/null +++ b/Questions.txt @@ -0,0 +1,11 @@ +Machine Learning ToolBox Answers + +What is the general trend in the curve? +The general trend of the curve is that the computer appears to score a higher accuracy on the test when it's given more of the set to train on. This is apparent in the graph in the repository, as can be seen. The x axis is portion of the set given for training, and the y axis is how well the computer did on the test portion of the set. Given the increasing curve, it appears that as the training set increases, so does the computer's ability to score well on the test portion. This makes sense, because if the computer had more to practice on and learn, then it can do better when it's being tested. +Are there parts of the curve that appear to be noisier than others? Why? +It appears that the first half of the graph appears to be more noisy than the rest of the curve. This is probably because at the beginning the computer still doesn't have a comprehensive understanding of the program or the training sets, so its scoring is a little more random and insensible. As the computer gets more and more to practice on, thus “learning” more, it becomes more confident and able to answer the problems, as can be seen with the latter half of the curve smoothing out. The noise at the beginning is probably a result of a lot of guessing and estimation on the computer's part, and towards the end the computer is actually computing the answers, thus smoothness increases as you move more and more to the right. +How many trials do you need to get a smooth curve? +10 Trials obviously isn't enough, there is way too much noise. I then tried 100 trials, which came out better but still had noise. 1000 Trials was almost smooth, but had some bumps in the middle. 5000 Trials (the longest I was willing to wait for this script to run) was basically smooth. There is a screenshot of the curve for 5000 trials. If you wanted to be super scrutinizing, there is probably a little bit of noise on the 5000 trial one as well. My best guess would be that 5000-10000 trials are needed for a smooth enough curve for machine learning purposes, and probably an infinite number of trials for a perfectly smooth curve. +Try different values for C (by changing LogisticRegression(C=10**-10)). What happens? +When C becomes 1**-10, the graph shifts up by a lot, basically saying that the computer's accuracy rose to almost perfection. When C becomes 100**-10, the graph goes up for a bit and then just tanks down to the 10% and stays that for the rest of the program, basically saying that the computer couldn't solve the problem after the training set got too high. When C becomes 9**-10, the line remains roughly the same as it would be with 10**-10. When C becomes 3**-10, the graph once again gets closer to when C = 1, but it was less high in accuracy. It seems that as C gets bigger, the accuracy rate becomes larger and more precise, with a larger concavity down, and when C is smaller, so does the accuracy rate, precision, and concavity. + diff --git a/learning_curve.py b/learning_curve.py index 2364f2c..8ef5678 100644 --- a/learning_curve.py +++ b/learning_curve.py @@ -1,5 +1,11 @@ """ Exploring learning curves for classification of handwritten digits """ +""" +Completed by Kevin Zhang + +Sofware Design Spring 2016 +""" + import matplotlib.pyplot as plt import numpy from sklearn.datasets import * @@ -8,7 +14,7 @@ data = load_digits() print data.DESCR -num_trials = 10 +num_trials = 100 train_percentages = range(5,95,5) test_accuracies = numpy.zeros(len(train_percentages)) @@ -17,7 +23,26 @@ # You should repeat each training percentage num_trials times to smooth out variability # for consistency with the previous example use model = LogisticRegression(C=10**-10) for your learner -# TODO: your code here + + +for training_index in range(len(train_percentages)): + + data = load_digits() + stablizing_value = 0; #temp variable to hold a bunch of value for smoothing out variability + + for i in range(num_trials): #repeated the test for each train_size value 10 times for more stability and smoothness + X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, train_size=train_percentages[training_index]/100.0) + model = LogisticRegression(C=10**-10) + model.fit(X_train, y_train) + print "Training with {}%".format(train_percentages[training_index]) + print "Train accuracy %f" %model.score(X_train,y_train) + print "Test accuracy %f"%model.score(X_test,y_test) + print '' + stablizing_value +=model.score(X_test, y_test) + + stablizing_value /= num_trials #take the average of all the values you accumulated + test_accuracies[training_index] = stablizing_value + fig = plt.figure() plt.plot(train_percentages, test_accuracies)