Version control for data science & machine learning with DVC

Marie-Hélène Burle

December 12, 2023

Put data under DVC tracking

We are still not tracking any data:

dvc status

There are no data or pipelines tracked in this project yet.

You can choose what to track as a unit (i.e. each picture individually, the whole data directory as a unit)

Let’s break it down by set:

dvc add data/raw/train
dvc add data/raw/val

This adds data to .dvc/cache/files and created 3 files in data/raw:

.gitignore
train.dvc
val.dvc

The .gitignore tells Git not to track the data:

cat data/raw/.gitignore

/train
/val

The .dvc files contain the metadata for the cached directories

Tracked data

We are all good:

dvc status

Data and pipelines are up to date.

Cache ⟷ working directory
	Duplication	Editable
Reflinks*	Only when needed	Yes
Hardlinks/Symlinks	No	No
Copies	Yes	Yes

Commit the metafiles

The metafiles should be put under Git version control

You can configure DVC to automatically stage its newly created system files:

dvc config [--system] [--global] core.autostage true

You can then commit directly:

git commit -m "Initial version of data"
git status

On branch main
nothing to commit, working tree clean

Add changes to DVC

dvc add data/raw/val
dvc status

Data and pipelines are up to date.

Now we need to commit the changes to the .dvc file to Git:

git status

On branch main
Changes to be committed:
    modified:   data/raw/val.dvc

Staging happened automatically because I have set the autostage option to true on my system

git commit -m "Delete data/raw/val/n03445777/ILSVRC2012_val*"

Now, all we have to do is to checkout the commit we want:

git log --oneline

94b520b (HEAD -> main) Delete data/raw/val/n03445777/ILSVRC2012_val*
92837a6 Initial version of data
dd961c6 Add scripts
db9c14e Initialize repo
7e08586 Initialize DVC

git checkout 92837a6

The version of the data in the working directory got automatically switched to match the .dvc file:

dvc status

Data and pipelines are up to date.

You can look at your files to verify that the deleted files are back

Git workflows

git checkout is ok to have a look, but a detached HEAD is not a good place to create new commits

Let’s create a new branch and switch to it:

git switch -c alternative

Switched to a new branch 'alternative'

Going back and forth between both versions of our data is now as simple as switching branch:

git switch main
git switch alternative

Create a pipeline

1^st stage (data preparation):

dvc stage add -n prepare -d src/prepare.py -d data/raw \
    -o data/prepared/train.csv -o data/prepared/test.csv \
    python src/prepare.py

Added stage 'prepare' in 'dvc.yaml'

2^nd stage (training)

dvc stage add -n train -d src/train.py -d data/prepared/train.csv \
    -o model/model.joblib \
    python src/train.py

Added stage `train` in 'dvc.yaml'

3^rd stage (evaluation)

dvc stage add -n evaluate -d src/evaluate.py -d model/model.joblib \
    -M metrics/accuracy.json \
    python src/evaluate.py

Added stage `evaluate` in 'dvc.yaml'

Commit pipeline

git commit -m "Define pipeline"

prepare:
    changed deps:
            modified:           data/raw
            modified:           src/prepare.py
    changed outs:
            deleted:            data/prepared/test.csv
            deleted:            data/prepared/train.csv
train:
    changed deps:
            deleted:            data/prepared/train.csv
            modified:           src/train.py
    changed outs:
            deleted:            model/model.joblib
evaluate:
    changed deps:
            deleted:            model/model.joblib
            modified:           src/evaluate.py
    changed outs:
            deleted:            metrics/accuracy.json
[main 4aa331b] Define pipeline
 3 files changed, 27 insertions(+)
 create mode 100644 data/prepared/.gitignore
 create mode 100644 dvc.yaml
 create mode 100644 model/.gitignore

Visualize pipeline in a DAG

dvc dag

+--------------------+         +------------------+
| data/raw/train.dvc |         | data/raw/val.dvc |
+--------------------+         +------------------+
                  ***           ***
                     **       **
                       **   **
                    +---------+
                    | prepare |
                    +---------+
                          *
                          *
                          *
                      +-------+
                      | train |
                      +-------+
                          *
                          *
                          *
                    +----------+
                    | evaluate |
                    +----------+

Run pipeline

dvc repro

'data/raw/train.dvc' didn't change, skipping
'data/raw/val.dvc' didn't change, skipping
Running stage 'prepare':
> python src/prepare.py
Generating lock file 'dvc.lock'
Updating lock file 'dvc.lock'

Running stage 'train':
> python src/train.py
Updating lock file 'dvc.lock'

Running stage 'evaluate':
> python src/evaluate.py
Updating lock file 'dvc.lock'
Use `dvc push` to send your updates to remote storage.

Version control for data science & machine learning with DVC Marie-Hélène Burle December 12, 2023

Version control for data science & machine learning with DVC
On version control
Extending Git for data
Extending Git for models and experiments
Many moving parts
Enters DVC
DVC principles
Installation
How to run
Acknowledgements
The project
Initialize Git repo
Initialize DVC project
Commit DVC system files
Prepare repo
Clean working tree
Tracking data with DVC
Put data under DVC tracking
This adds data to...
Tracked data
Data (de)duplication
Commit the metafiles
Track changes to the data
Add changes to DVC
Check out older versions
Now, all we have...
Git workflows
Collaboration
Classic workflow
DVC remotes
Commit remote config
Push to remotes
Pull from remotes
Tracking experiments
DVC pipelines
Example
Create a pipeline
Commit pipeline
Visualize pipeline in a DAG
Run pipeline
dvc repro breakdown
Results of the run
Clean working tree
Modify pipeline
Going further … next time