Jul 12, 2020
Machine Learning best practices and guidelines, tools to be used, while you're a developer. It applies to Data Analysts, Data Engineers, Machine Learning Engineers, Data Scientists or any research team in general
Photo by Daniel Chekalov / Unsplash
This post contains guidelines, best practices, tools to be used, while you're a developer. It applies to Data Analysts, Data Engineers, Machine Learning Engineers, Data Scientists or any research team in general
Contents
Git and Github
- If you're new to Git and Github/Gitlab watch this course from Udacity
- Ensure your Git client is configured with the correct email address and linked to your GitHub/Gitlab user
- Use git-based repositories, all code pushed to the company's GitLab (or GitHub). Request from your manager the access to your respective groups so that you can create repositories and push your code
- Don't push your code directly to
master
branch. Use branches, tags. - Always send a Pull Request (Merge Request) to your senior developer. If you're working alone, send it to yourself.
- Install and use Github Desktop for better code management and visibility. Install Github CLI and Hub CLI if you're a CLI pro
- Read more here
.gitignore
- Be sure to ignore trivial files, dependencies
- Ignore larger files such as images, cache, private key files
- If you're not aware of what to be ignored, use gitignore.io to help yourself create a .gitignore file
Commit Messages
You're not expected to follow everything mentioned in the below links but rather develop a habit of writing good commit messages
Secret Keys
- Never, ever commit any of the API Keys, Secret Keys, Tokens, URLs or Passwords in any of the files.
- Read more here and here
- Use .env files and read the keys from the environmental variables. It depends on the language and tools you use. Eg: Python or Node or Docker
- You should exclude .env file from commits by adding .env to the .gitignore. You can also upload an example configuration .env.sample with dummy data or blanks to show the schema your application requires
- In case you commit a secret key by mistake, do notify to your senior developer or manager at the earliest. Read more on the removal of sensitive data here
README.md
- Be sure to include a README.md file in every repository you create
- Find best practices here and try to incorporate whichever suits your work
Githooks
Git hooks are scripts that Git executes before or after events such as: commit, push, and receive. Checkout Githooks
Data
- Use Data Version Control. DVC usually runs along with Git. Git is used as usual to store and version code (including DVC meta-files). DVC helps to store data and model files seamlessly out of Git, while preserving almost the same user experience as if they were stored in Git itself
- Read more at their site and here
ML
- Try to use Continuous Machine Learning (CML)
- Read a detailed guidline on ML best practices by Google
Notebooks
You should write notebooks in such a way that anyone can rerun it on the same inputs, and produce the same outputs. Your notebook should be executable from top to bottom and should contain the information required to set up the correct, consistent environment. Create templates for common tasks so that it can be used by other team members. Also use JupyterLabs instead of the traditional Jupyter Notebooks. Avoid using Google Colab unless it's absolutely necessary.
Summary
- Follow established software development best practices: OOP, style guides, documentation
- You should institute version control for your Notebooks
- Reproducible Notebooks
- Continuous Integration (CI)
- Parameterized Notebooks
- Continuous Deployment (CD)
- Log all experiments automatically
Notebook guidelines
- Organizing your code: Write classes, modules in separate files and import these into your notebooks. Keep your notebook clean and do not write too many lines of code
- Variables: Re-create new variables. Do not hard-code numerical constants, URL strings etc. Use a python global constant for the same
- TDD: Write test cases for your modules. Read first here and then here
Tracking Experiments
- Tracking experiments to record and compare parameters and results. It is necessary for you and your teammates to keep track of experiments and document them
- Use a tool such as ML Flow to code in a reusable, reproducible form in order to share with other data scientists or transfer to production
- You can find detailed tutorials and examples at their site. However, here are a few more suggestions