Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 15 additions & 3 deletions learning_curve.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,19 +6,31 @@
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LogisticRegression


data = load_digits()
print data.DESCR
num_trials = 10

num_trials = 100
train_percentages = range(5,95,5)
test_accuracies = numpy.zeros(len(train_percentages))
test_accuracies = []
for i in train_percentages:
avg_test_accuracy = 0
for j in range(0, num_trials):
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, train_size=i / 100.0)
model = LogisticRegression(C=10**-3)
model.fit(X_train, y_train)
avg_test_accuracy += model.score(X_test, y_test)
avg_test_accuracy /= num_trials
print i
print "Test accuracy %f"%avg_test_accuracy
test_accuracies.append(avg_test_accuracy)

# train a model with training percentages between 5 and 90 (see train_percentages) and evaluate
# the resultant accuracy.
# You should repeat each training percentage num_trials times to smooth out variability
# for consistency with the previous example use model = LogisticRegression(C=10**-10) for your learner

# TODO: your code here

fig = plt.figure()
plt.plot(train_percentages, test_accuracies)
plt.xlabel('Percentage of Data Used for Training')
Expand Down
4 changes: 4 additions & 0 deletions questions.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
1. The general trend in the curve is upwards, with some amount of diminishing returns at higher percentages of data used for training.
2. The lower percentages of data used for training tends to result in more noise. This makes sense because with less training data, the model is likely to have more fluctuations in accuracy.
3. 1000 trials produces a decently smooth curve.
4. As C increases, the accuracy values all increase since the acceptable threshold decreases (C is the inverse of regularization strength). The graph also acquires a more curved shape, suggesting that the rate of diminishing returns increases as C increases. In other words, the first few percentage points of increase for low values of C matter less than for they do for high values of C.