-
Notifications
You must be signed in to change notification settings - Fork 444
test script for deepseek #2902
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
test script for deepseek #2902
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
dbed671 to
4f5a1c4
Compare
|
🤖 Hi @shuningjin, I've received your request, and I'm working on it now! You can track my progress in the logs for more details. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
📋 Review Summary
This Pull Request introduces new test scripts for DeepSeek models (v2-16b and v3-671b) and refactors existing test scripts for GPT-OSS, Llama4, and Mixtral. The changes aim to onboard DeepSeek models, improve consistency in script practices, and adjust logging verbosity.
🔍 General Feedback
- The refactoring of test scripts into modular
1_test_deepseek.shand2_test_deepseek.shfor DeepSeek v3 is a positive step for organization and clarity. - Consistent application of
absl.logging.set_verbosity(absl.logging.INFO)across relevant Python files is good for debugging and monitoring. - The general updates to how
BASE_OUTPUT_PATHis handled in the shell scripts improve robustness. - Consider reviewing the commented-out training and fine-tuning steps in the DeepSeek test scripts to ensure they are intentionally disabled and to add explanatory comments if needed.
- Ensure all golden logit paths are robust and not dependent on transient local paths or specific dates.
RissyRan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work! Also thanks for keeping them up-to-date! A few minor comments
| # Run fine-tuning - megablox implementation | ||
| python3 -m MaxText.train "${MAXTEXT_PKG_DIR:-${MAXTEXT_REPO_ROOT:-$PWD}/src/MaxText}/"configs/base.yml base_output_directory=${BASE_OUTPUT_PATH} run_name=megablox_fine_tuning model_name=${MODEL_NAME} tokenizer_type=huggingface tokenizer_path=${TOKENIZER_PATH} dataset_path=${DATASET_PATH} enable_checkpointing=true async_checkpointing=false load_parameters_path=${SCANNED_CKPT_PATH} scan_layers=True attention=flash sparse_matmul=True megablox=True dtype=bfloat16 weight_dtype=bfloat16 per_device_batch_size=4 steps=5 max_target_length=1024 ici_fsdp_parallelism=1 ici_expert_parallelism=4 checkpoint_storage_concurrent_gb=1024 | ||
|
|
||
| # Run supervised fine-tuning - megablox implementation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's remove it if it's comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed this unused command for sft
| # Run pre-training - megablox implementation | ||
| python3 -m MaxText.train "${MAXTEXT_PKG_DIR:-${MAXTEXT_REPO_ROOT:-$PWD}/src/MaxText}/"configs/base.yml base_output_directory=${BASE_OUTPUT_PATH} run_name=megablox_pre_training model_name=${MODEL_NAME} tokenizer_type=huggingface tokenizer_path=${TOKENIZER_PATH} dataset_type=synthetic enable_checkpointing=false attention=flash sparse_matmul=True megablox=True dtype=bfloat16 weight_dtype=bfloat16 per_device_batch_size=4 steps=5 max_target_length=1024 ici_fsdp_parallelism=4 | ||
|
|
||
| # Run fine-tuning - megablox implementation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's update megablox to be tokamax gmm. We may deprecate megablox sometime later, and most of workloads are using tokamax gmm now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for clarifying the MoE strategies! I updated to tokamax_gmm for training and decoding for deepseek
|
|
||
| python3 -m tests.forward_pass_logit_checker ${MAXTEXT_PKG_DIR}/configs/base.yml base_output_directory=${BASE_OUTPUT_PATH} run_name=forward_logits_check load_parameters_path=${SCANNED_CKPT_PATH} scan_layers=true attention=dot_product per_device_batch_size=1 model_name=${MODEL_NAME} max_prefill_predict_length=4 max_target_length=4 async_checkpointing=false sparse_matmul=false ici_fsdp_parallelism=1 ici_expert_parallelism=-1 checkpoint_storage_concurrent_gb=1024 weight_dtype=float32 dtype=float32 activations_in_float32=true matmul_precision=highest float32_logits=true float32_qk_product=true --golden_logits_path=${GOLDEN_LOGITS_DISK_LOCATION} --atol=1.5 --rtol=1.5 --max_kl_div=0.1 | ||
|
|
||
| # Run pre-training - megablox implementation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similar comments above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume you mean remove (rather than update with tokamax)? Removed train/finetune/sft
| idx=$(date +%Y-%m-%d) | ||
|
|
||
| # By default, we'll use "llama4-17b-16e" | ||
| if [ -z "${MODEL_VARIATION}" ]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to set this from XL ML? or we cold direct export MODEL_VARIATION="llama4-17b-16e" without if/else
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For other models, we have separate test scripts for each variant. For llama4, this serves as the united script. Do we need to account for Maverick as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I think we should do something similar to GPT-OSS models. By passing the type instead of duplicate the scripts. No need to be in this PR.
|
|
||
| # Test whether the forward pass logits match the golden logits - matmul implementation | ||
| python3 -m tests.forward_pass_logit_checker "${MAXTEXT_PKG_DIR:-${MAXTEXT_REPO_ROOT:-$PWD}/src/MaxText}/"configs/base.yml base_output_directory=${BASE_OUTPUT_PATH} load_parameters_path=${UNSCANNED_CKPT_PATH} run_name=matmul_forward_pass_test per_device_batch_size=4 model_name=mixtral-8x7b tokenizer_path="${MAXTEXT_ASSETS_ROOT:-${MAXTEXT_PKG_DIR:-${MAXTEXT_REPO_ROOT:-$PWD}/src/MaxText/assets}}"/tokenizer.mistral-v1 ici_tensor_parallelism=1 ici_fsdp_parallelism=-1 max_prefill_predict_length=11 max_target_length=11 dataset_type=synthetic dtype=float32 megablox=False sparse_matmul=False scan_layers=false --atol=3 --rtol=1 --token_size=4 | ||
| python3 -m tests.forward_pass_logit_checker "${MAXTEXT_PKG_DIR:-${MAXTEXT_REPO_ROOT:-$PWD}/src/MaxText}/"configs/base.yml base_output_directory=${BASE_OUTPUT_PATH} load_parameters_path=${UNSCANNED_CKPT_PATH} run_name=matmul_forward_pass_test per_device_batch_size=1 model_name=mixtral-8x7b tokenizer_path="${MAXTEXT_ASSETS_ROOT:-${MAXTEXT_PKG_DIR:-${MAXTEXT_REPO_ROOT:-$PWD}/src/MaxText/assets}}"/tokenizer.mistral-v1 ici_tensor_parallelism=1 ici_fsdp_parallelism=-1 max_prefill_predict_length=11 max_target_length=11 dataset_type=synthetic dtype=float32 megablox=False sparse_matmul=False scan_layers=false --token_size=4 --atol=3 --rtol=1 --max_kl_div=2e-3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know if --atol=3 --rtol=1 is still needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed --atol=3 --rtol=1, instead using criteria --max_kl_div=2e-3.
RissyRan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates!
| idx=$(date +%Y-%m-%d) | ||
|
|
||
| # By default, we'll use "llama4-17b-16e" | ||
| if [ -z "${MODEL_VARIATION}" ]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I think we should do something similar to GPT-OSS models. By passing the type instead of duplicate the scripts. No need to be in this PR.
Description
Main change
onboard test for deepseek2-16b, plan to run full workflow in XLML on v5p-8
end_to_end/tpu/deepseek/v2-16b/test_deepseek.sh: ckpt conversion, logit check, train, finetune, decodeonboard test for deepseek3-671b, plan to run step 2 in XLML on v5p-128
end_to_end/tpu/deepseek/v3-671b/1_test_deepseek.shend_to_end/tpu/deepseek/v3-671b/2_test_deepseek.sh: logit check (scan), decode (unscan). Note we leave out training and finetuning, as is covered by ubench nightly test.XLML PR: GoogleCloudPlatform/ml-auto-solutions#1112
FIXES: b/423057893
Other change
max_logging.logbehavior is changed due to PR#2873, need to set absl.logging.level to make logs visible for multiple scriptsTests
preparation
bash end_to_end/tpu/deepseek/v3-671b/2_test_deepseek.shbash end_to_end/tpu/deepseek/v2-16b/test_deepseek.shChecklist
Before submitting this PR, please make sure (put X in square brackets):
gemini-reviewlabel.