Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 24 additions & 3 deletions learning_curve.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,8 @@
from sklearn.linear_model import LogisticRegression

data = load_digits()
print data.DESCR
num_trials = 10
# print data.DESCR
num_trials = 20
train_percentages = range(5,95,5)
test_accuracies = numpy.zeros(len(train_percentages))

Expand All @@ -17,10 +17,31 @@
# You should repeat each training percentage num_trials times to smooth out variability
# for consistency with the previous example use model = LogisticRegression(C=10**-10) for your learner

# TODO: your code here

def train_test(percent):
'''partitions data into training and testing sets, using these groups to train and test the data,
and returns tesing accuracy
percent: percent of data partitioned for training'''
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, train_size = percent/100.0)
model = LogisticRegression(C=10**0)
model.fit(X_train, y_train)
return model.score(X_test,y_test)

for i in enumerate(train_percentages):
t = 0
for j in range(num_trials):
t += train_test(i[1])/num_trials #averages accuracies for each percentage
test_accuracies[i[0]] = t



fig = plt.figure()
# for i in range(10):
# subplot = fig.add_subplot(5,2,i+1)
# subplot.matshow(numpy.reshape(data.data[i], \
# (8,8)), cmap='gray')
plt.plot(train_percentages, test_accuracies)
plt.xlabel('Percentage of Data Used for Training')
plt.ylabel('Accuracy on Test Set')
plt.title(str(num_trials))
plt.show()
7 changes: 7 additions & 0 deletions questions.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
1. The accuracey increases as the percentage of the data partitioned for training increases.

2. The ends of the curve appear to be slightly less noisey than the middle. This may be because there is a greater chance of accuracy when the amount of training data is very high, and there is little chance for variation because the testing set was so small. At the bottom, the reverse is true where there is so little training data and so much testing data, so there is a very small chance that the tests will yield a high accuracy. In the middle, there is a greater chance for variation in accuracey.

3. The curve seems to be much smoother after about 25 trials

4. The trend of the curve becomes increasingly logarithmic as C becomes larger.