This repository follows a standard setup I use for data science projects, which includes:
- A research compendium layout, including a local Python package (see File Structure).
- Visual Studio Code (VSC) as the preferred IDE, with recommended extensions.
- A VS Code Dev Container, powered by Docker, as a reproducible development environment (using a Debian image).
- pre-commit to manage git hooks.
- Python tooling:
- Black for code formatting (pre-commit and VSC extension). In addition, I mostly follow the Google style guide.
- Ruff (pre-commit and VSC extension) for linting.
- mypy for type checking (VSC extension).
- uv to compile requirements.
- pdoc to generate API documentation (including a pre-commit hook for generating a local documentation). Python docstrings are written following the Google docstring format and with the help of the autoDocstring VSC extension.
- pytest for testing, with doctest enabled.
- Automatic versioning of the local package from git tags via setuptools_scm, following semantic versioning.
- SQLFluff as a formatter and linter for SQL files (pre-commit and VSC extension).
- prettier (VSC extension) as a formatter for YAML, JSON and Markdown files.
- markdownlint (VSC extension) as a linter for Markdown files.
- Taplo (VSC extension) as a formatter for TOML files.
- shfmt (VSC extension) as a formatter for shell scripts.
- SonarLint (VSC extension) as an additional multi-language linter.
- typos (VSC extension) as a code spell checker.
- A Makefile to provide an interface to common tasks (see Make commands).
- Conventional commit messages (enforced by pre-commit).
.
├── analysis/ # Analysis scripts and notebooks
├── data/ # Data files (usually git ignored)
├── docs/ # API documentation (git ignored)
├── results/ # Output files: figures, tables, etc. (git ignored)
├── scripts/ # Utility scripts (e.g. env setup)
├── src/ # Local Python package
│ ├── __init__.py
│ └── config.py # Configs, constants, settings
├── tests/ # Unit tests for src/
│ └── test_*.py
├── .devcontainer/ # VS Code Dev container setup
├── .vscode/ # VS Code settings and extensions
├── Dockerfile # Dockerfile used for dev container
├── Makefile # Utility commands (docs, env, versioning)
├── pyproject.toml # Configs for package, tools (Ruff, mypy, etc.) and direct deps
├── requirements.txt # Pinned dependencies (generated)
├── taplo.toml # Configs for TOML formatter
├── .editorconfig # Configs for Shell formatter
├── .pre-commit-config.yaml # Configs for pre-commit
├── .sqlfluff # Configs for SQLFluff
The preferred development environment for this project is a VS Code Dev Container, which provides a consistent and reproducible setup using Docker.
- Install and launch Docker.
- Install and open the project in VS Code.
- Open the container by using the command palette in VS Code (
Ctrl + Shift + P) to search for "Dev Containers: Open Folder in Container...".
The dependencies specified in requirements.txt are automatically installed in the container and the local package is available in editable mode.
If needed, the container can be rebuilt by searching for "Dev Containers: Rebuild Container...".
For more details regarding Dev Containers, or alternative environment setups (venv, Conda, etc.), please refer to DEVELOPMENT.md.
Regardless of the environment, install Git hooks after setup with pre-commit install to ensure the code is automatically linted and formatted on commit.
Requirements are managed with:
pyproject.tomlto list direct dependencies of thesrcpackage and development dependencies (e.g. for the analysis).requirements.txtto pin all dependencies (direct and indirect). This file is automatically generated with uv and is used to fully recreate the environment.
src) is not included in requirements.txt, so installation is a two-step process.
-
Initial setup or adding new direct dependencies:
- Add dependencies to
pyproject.toml. - Run
make reqsto compilerequirements.txt.
- Add dependencies to
-
Upgrading packages: compile new requirements with
uv pip compile pyproject.toml -o requirements.txt --all-extras --upgrade, then make deps.
Finally, run make deps to install pinned dependencies and the local package in editable mode.
Common utility commands are available via the Makefile, including:
make reqs: Compilerequirements.txtfrompyproject.toml.make deps: Install requirements and the local package.make docs: Generate the package documentation.make tag: Create and push a new Git tag by incrementing the version.make venv: Set up a venv environment (seeDEVELOPMENT.md).
The full list of targets can be listed with make help.
Delete this section after initialising a project from the template.
This template aims to be relatively lightweight and tailored to my needs. It is therefore opinionated and also in constant evolution, reflecting with data science journey with Python. It is also worth noting that this template is more focused on experimentation rather than sharing a single final product.
-
Initialise your GitHub repository with this template. Alternatively, fork (or copy the content of) this repository.
-
Update
- project metadata in
pyproject.toml, such as the description and the authors. - the repository name (if the template was forked).
- the README (title, badges, sections).
- the license.
- project metadata in
-
Set up your preferred development environment (see Development Environment).
-
Specify, compile and install your requirements (see Managing requirements).
-
Adjust the configurations to your needs (e.g. Python configuration in
src/config.py, the SQL dialect in.sqlfluff, etc.). -
Add a git tag for the initial version with
git tag -a v0.1.0 -m "Initial setup", and push it withgit push origin --tags. Alternatively, usemake tag. -
(Optional) Update pre-commit with
pre-commit autoupdate.
The src/ package could contain the following modules or sub-packages depending on the project:
utilsfor utility functions.data_processing,dataordatasetsfor data processing functions.featuresfor extracting features.modelsfor defining models.evaluationfor evaluating performance.plotsfor plotting functions.
The repository structure could be extended with:
models/to store model files.- subfolders in
data/such asdata/raw/for storing raw data.
MLflow can be used as a tool to track Machine Learning experiments.
Often, MLflow will be configured so that the results are saved on a remote database and artifact store.
If this is not the case, the following can be added in src/config.py to set up a local storage for MLflow experiments:
MLRUNS_DIR = RES_DIR / "mlruns"
TRACKING_URI = "file:///" + MLRUNS_DIR.as_posix()
os.environ["MLFLOW_TRACKING_URI"] = TRACKING_URIThen, the MLflow UI can be launched with:
mlflow ui --backend-store-uri file:///path/to/results/mlrunsFor a slightly more elaborate setup running a MLflow server with a local database and artifact store as part of a DevContainer, see ghurault/mlflow-devcontainer.
Configurations, such as credentials, can be loaded from a .env file.
This can be achieved by mounting a .env file directly in the Dev Container, updating the runArgs option in .devcontainer/devcontainer.json accordingly.
Alternatively, one can use the python-dotenv package and add the following in src/config.py:
from dotenv import load_dotenv
load_dotenv()A full project documentation (beyond the API) could be generated using mkdocs or quartodoc.
This template is not tied to a specific platform and does not include continuous integration workflows. Nevertheless, the template could be extended with the following integrations:
- GitHub's Dependabot for dependency updates, or pip-audit.
- Testing and code coverage.
- Building and hosting documentation.
This template is inspired by the concept of a research compendium, similar projects I created for R projects (e.g. reproducible-workflow), and other, more exhaustive, templates such as: