The Archer series focuses on research into RL algorithms and training for medium and small-scale models, aiming to deepen the community's understanding of the fundamental principles of reinforcement learning (RL) on large language models (LLMs). All released content will be comprehensively open-sourced to advance community research development.
Archer significantly improves the reasoning performance upon DAPO and outperforms previous 1.5B-level SOTA reasoning models.
Archer is an open-source initiative enhancing reasoning in large language models through scalable, rule-governed reinforcement learning. We provide full-stack reproducibility including:
- Training code and pipelines
- Curated datasets
- Trained models
- Complete training logs
Current Models:
- Archer-Code-1.5B - SOTA among similarly-sized models.
We conduct evaluation on both mathematical and coding benchmarks. Due to the high variance of the outputs from reasoning models, we report avg@K (pass@1 performance averaged over K outputs) and pass@K for each benchmark. The detailed results are shown in the table below.
# Installing Python 3.10 Environment.
conda create -n archer python=3.10 -y
conda activate archer
# Installing dependencies.
pip install torch==2.5.1 --index-url https://download.pytorch.org/whl/cu124
wget -nv https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.3/flash_attn-2.7.3+cu12torch2.5cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
pip install --no-cache-dir flash_attn-2.7.3+cu12torch2.5cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
cd ArcherCodeR
pip install -e .Download the training and test data from Hugging Face.
python tools/download_datasets.pyWe have provided a one-click script to initialize Ray environments on any number of machines. Run the following command on the head node:
bash ./tools/start_ray.shNote:
- Please replace your_wandb_api_key in export WANDB_API_KEY=your_wandb_api_key with your actual key.
- Hostfile locations vary across operating systems (e.g., on my machine, it's located at /etc/mpi/hostfile). Locate the file on your server and modify its content accordingly.
We have currently only provided the script and data to reproduce the results of the “Archer-Code-1.5B”.
bash ./scripts/train/run_archer_qwen2.5_1.5b_code.shRun the following command to convert the model to Hugging Face format:
bash ./tools/model_merge.shExecute the script below to evaluate model performance on the LiveCodeBench v5 benchmark:
bash ./scripts/eval/run_eval.shNote: Please update the path parameters in the scripts above as needed.
Stabilizing Knowledge, Promoting Reasoning: Dual-Token Constraints for RLVR
- We build our model upon
DeepSeek-R1-Distill-Qwen-1.5B. - Training was carried out with a modified version of verl.
Please cite the following:
@article{wang2025stabilizing,
title={Stabilizing Knowledge, Promoting Reasoning: Dual-Token Constraints for RLVR},
author={Wang, Jiakang and Liu, Runze and Zhang, Fuzheng and Li, Xiu and Zhou, Guorui},
journal={arXiv preprint arXiv:2507.15778},
year={2025}
}
