Machine Learning Best Practices For Your Company

Jul 12, 2020

Machine Learning best practices and guidelines, tools to be used, while you're a developer. It applies to Data Analysts, Data Engineers, Machine Learning Engineers, Data Scientists or any research team in general

Machine Learning Best Practices For Your Company Photo by Daniel Chekalov / Unsplash

This post contains guidelines, best practices, tools to be used, while you're a developer. It applies to Data Analysts, Data Engineers, Machine Learning Engineers, Data Scientists or any research team in general

Git and Github

If you're new to Git and Github/Gitlab watch this course from Udacity
Ensure your Git client is configured with the correct email address and linked to your GitHub/Gitlab user
Use git-based repositories, all code pushed to the company's GitLab (or GitHub). Request from your manager the access to your respective groups so that you can create repositories and push your code
Don't push your code directly to master branch. Use branches, tags.
Always send a Pull Request (Merge Request) to your senior developer. If you're working alone, send it to yourself.
Install and use Github Desktop for better code management and visibility. Install Github CLI and Hub CLI if you're a CLI pro
Read more here

.gitignore

Be sure to ignore trivial files, dependencies
Ignore larger files such as images, cache, private key files
If you're not aware of what to be ignored, use gitignore.io to help yourself create a .gitignore file

Commit Messages

You're not expected to follow everything mentioned in the below links but rather develop a habit of writing good commit messages

Secret Keys

Never, ever commit any of the API Keys, Secret Keys, Tokens, URLs or Passwords in any of the files.
Read more here and here
Use .env files and read the keys from the environmental variables. It depends on the language and tools you use. Eg: Python or Node or Docker
You should exclude .env file from commits by adding .env to the .gitignore. You can also upload an example configuration .env.sample with dummy data or blanks to show the schema your application requires
In case you commit a secret key by mistake, do notify to your senior developer or manager at the earliest. Read more on the removal of sensitive data here

README.md

Be sure to include a README.md file in every repository you create
Find best practices here and try to incorporate whichever suits your work

Githooks

Git hooks are scripts that Git executes before or after events such as: commit, push, and receive. Checkout Githooks

Data

Use Data Version Control. DVC usually runs along with Git. Git is used as usual to store and version code (including DVC meta-files). DVC helps to store data and model files seamlessly out of Git, while preserving almost the same user experience as if they were stored in Git itself
Read more at their site and here

ML

Try to use Continuous Machine Learning (CML)
Read a detailed guidline on ML best practices by Google

Notebooks

You should write notebooks in such a way that anyone can rerun it on the same inputs, and produce the same outputs. Your notebook should be executable from top to bottom and should contain the information required to set up the correct, consistent environment. Create templates for common tasks so that it can be used by other team members. Also use JupyterLabs instead of the traditional Jupyter Notebooks. Avoid using Google Colab unless it's absolutely necessary.

Summary

Follow established software development best practices: OOP, style guides, documentation
You should institute version control for your Notebooks
Reproducible Notebooks
Continuous Integration (CI)
Parameterized Notebooks
Continuous Deployment (CD)
Log all experiments automatically

Notebook guidelines

Organizing your code: Write classes, modules in separate files and import these into your notebooks. Keep your notebook clean and do not write too many lines of code
Variables: Re-create new variables. Do not hard-code numerical constants, URL strings etc. Use a python global constant for the same
TDD: Write test cases for your modules. Read first here and then here

Tracking Experiments

Tracking experiments to record and compare parameters and results. It is necessary for you and your teammates to keep track of experiments and document them
Use a tool such as ML Flow to code in a reusable, reproducible form in order to share with other data scientists or transfer to production
You can find detailed tutorials and examples at their site. However, here are a few more suggestions

ML Ops

Previous issue

Browse all issues

Next issue

Machine Learning Best Practices For Your Company

Table of contents

Contents

Git and Github

.gitignore

Commit Messages

Secret Keys

README.md

Githooks

Data

ML

Notebooks

Summary

Notebook guidelines

Tracking Experiments

ML Ops

Machine Learning Best Practices For Your Company

Table of contents

Contents

Git and Github

.gitignore

Commit Messages

Secret Keys

README.md

Githooks

Data

ML

Notebooks

Summary

Notebook guidelines

Tracking Experiments

ML Ops

Did you find this article valuable?