Skip to content

Conversation

@shayan74
Copy link

@shayan74 shayan74 commented Nov 7, 2025

Dear Jadi,

Thank you for creating such a wonderful machine learning course — I’ve been recommending it to Persian-speaking students who are eager to learn ML.

While reviewing the code, I noticed a small detail in the train/test split logic that might cause slight variations in the ratio. The current approach:

np.random.rand(len(df)) < 0.8

works well in general, but due to randomness, it may yield ratios anywhere between roughly 77% to 82% for training data. This is perfectly acceptable for large datasets, but in smaller datasets it can lead to noticeable deviations and potential confusion for learners.

To make the ratio more consistent, I suggest using:

def random_boolean_array(x, true_ratio=0.8):
n_true = int(x * true_ratio)
n_false = x - n_true
arr = np.array([True] * n_true + [False] * n_false)
np.random.shuffle(arr)
return arr

This approach tends to produce a more stable 80/20 distribution.

Or, for keeping the inline coding style (and avoiding manual splitting functions) we can improve this by using:

np.random.choice([True, False], size=len(df), p=[0.8, 0.2])

Thank you for your time and for the excellent educational content you share.

Damet Garm!
Shayan

@jadijadi
Copy link
Owner

Thanks for the contribution. Your logic looks valid and an important point. But this is a basic educational lesson and a one line choice is good enough. I think changing it to something which needs lots of explanation will frighten the students.

But it would be great if you can add this point as a comment below the actual code. Its ok to have a multi line comment describing the issue with my one line simple code and proposing the fix. but all commented.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants