Marie-Hélène Burle
December 12, 2023
I won’t introduce here the benefits of using a good version control system such as Git
While Git is a wonderful tool for text files versioning (code, writings in markup formats), it isn’t a tool to manage changes to datasets
Several open source tools—each with a different structure and functioning—extend Git capabilities to track data: Git LFS, git-annex, lakeFS, Dolt, DataLad
Reproducible research and collaboration on data science and machine learning projects involve more than datasets management:
Experiments and the models they produce also need to be tracked
*hp = hyperparameter
How did we get performance17
again? 🤯
Large files (datasets, models…) are kept outside Git
Each large file or directory put under DVC tracking has an associated .dvc
file
Git only tracks the .dvc
files (metadata)
Workflows can be tracked for collaboration and reproducibility
DVC functions as a Makefile and allows to only rerun what is necessary
For Linux (other OSes, refer to the doc):
pip
:
conda
pipx (if you want dvc
available everywhere without having to activate virtual envs):
Optional dependencies [s3]
, [gdrive]
, etc. for remote storage
Terminal
Python library if installed via pip
or conda
In this webinar, I will use DVC through the command line
Code and data for this webinar modified from:
├── LICENSE
├── data
│ ├── prepared
│ └── raw
│ ├── train
│ └── val
├── metrics
├── model
├── requirements.txt
└── src
├── evaluate.py
├── prepare.py
└── train.py
Initialized empty Git repository in dvc/.git/
This creates the .git
directory
Initialized DVC repository.
You can now commit the changes to git.
You will also see a note about usage analytics collection and info on how to opt out
A .dvc
directory and a .dvcignore
file got created
DVC automatically staged its system file for us:
On branch main
No commits yet
Changes to be committed:
new file: .dvc/.gitignore
new file: .dvc/config
new file: .dvcignore
Untracked files:
LICENSE
data/
requirements.txt
src/
So we can directly commit:
Let’s work in a virtual environment:
git add .gitignore LICENSE requirements.txt
git commit -m "Add general files"
git add src
git commit -m "Add scripts"
On branch main
Untracked files:
data/
Now, it is time to deal with the data
We are still not tracking any data:
There are no data or pipelines tracked in this project yet.
You can choose what to track as a unit (i.e. each picture individually, the whole data
directory as a unit)
Let’s break it down by set:
This adds data to .dvc/cache/files
and created 3 files in data/raw
:
.gitignore
train.dvc
val.dvc
The .gitignore
tells Git not to track the data:
/train
/val
The .dvc
files contain the metadata for the cached directories
We are all good:
Data and pipelines are up to date.
Link between checked-out version of a file/directory and the cache:
Duplication | Editable | |
---|---|---|
Reflinks* | Only when needed | Yes |
Hardlinks/Symlinks | No | No |
Copies | Yes | Yes |
*Reflinks only available for a few file systems (Btrfs, XFS, OCFS2, or APFS)
The metafiles should be put under Git version control
You can configure DVC to automatically stage its newly created system files:
You can then commit directly:
On branch main
nothing to commit, working tree clean
Let’s make some change to the data:
Remember that Git is not tracking the data:
Data and pipelines are up to date.
What if we want to go back to the 1st version of our data?
For this, we first use Git to checkout the proper commit, then run dvc checkout
to have the data catch up to the .dvc
file
To avoid forgetting to run the commands that will make DVC catch up to Git, we can automate this process by installing Git hooks:
Now, all we have to do is to checkout the commit we want:
94b520b (HEAD -> main) Delete data/raw/val/n03445777/ILSVRC2012_val*
92837a6 Initial version of data
dd961c6 Add scripts
db9c14e Initialize repo
7e08586 Initialize DVC
The version of the data in the working directory got automatically switched to match the .dvc
file:
Data and pipelines are up to date.
You can look at your files to verify that the deleted files are back
git checkout
is ok to have a look, but a detached HEAD is not a good place to create new commits
Let’s create a new branch and switch to it:
Switched to a new branch 'alternative'
Going back and forth between both versions of our data is now as simple as switching branch:
The Git project (including .dvc
files) go to a Git remote (GitHub/GitLab/Bitbucket/server)
The data go to a DVC remote (AWS/Azure/Google Drive/server/etc.)
DVC can use many cloud storage or remote machines/server via SSH, WebDAV, etc.
Let’s create a local remote here:
# Create a directory outside the project
mkdir ../remote
# Setup default (-d) remote
dvc remote add -d local_remote ../remote
Setting 'local_remote' as a default remote.
[core]
remote = local_remote
['remote "local_remote"']
url = ../../remote
The new remote configuration should be committed:
On branch alternative
Changes not staged for commit:
modified: .dvc/config
Let’s push the data from the cache (.dvc/cache
) to the remote:
2702 files pushed
With Git hooks installed, dvc push
is automatically run after git push
(But the data is pushed to the DVC remote while the files tracked by Git get pushed to the Git remote)
By default, the entire data cache gets pushed to the remote, but there are many options
dvc fetch
downloads data from the remote into the cache. To have it update the working directory, follow by dvc checkout
You can do these 2 commands at the same time with dvc pull
DVC pipelines create reproducible workflows and are functionally similar to Makefiles
Each step in a pipeline is created with dvc stage add
and add an entry to a dvc.yaml
file
dvc stage add
options:
-n
: name of stage
-d
: dependency
-o
: output
Each stage contains:
cmd
: the command executeddeps
: the dependenciesouts
: the outputsThe file is then used to visualize the pipeline and run it
Let’s create a pipeline to run a classifier on our data
The pipeline contains 3 steps:
1st stage (data preparation):
dvc stage add -n prepare -d src/prepare.py -d data/raw \
-o data/prepared/train.csv -o data/prepared/test.csv \
python src/prepare.py
Added stage 'prepare' in 'dvc.yaml'
prepare:
changed deps:
modified: data/raw
modified: src/prepare.py
changed outs:
deleted: data/prepared/test.csv
deleted: data/prepared/train.csv
train:
changed deps:
deleted: data/prepared/train.csv
modified: src/train.py
changed outs:
deleted: model/model.joblib
evaluate:
changed deps:
deleted: model/model.joblib
modified: src/evaluate.py
changed outs:
deleted: metrics/accuracy.json
[main 4aa331b] Define pipeline
3 files changed, 27 insertions(+)
create mode 100644 data/prepared/.gitignore
create mode 100644 dvc.yaml
create mode 100644 model/.gitignore
+--------------------+ +------------------+
| data/raw/train.dvc | | data/raw/val.dvc |
+--------------------+ +------------------+
*** ***
** **
** **
+---------+
| prepare |
+---------+
*
*
*
+-------+
| train |
+-------+
*
*
*
+----------+
| evaluate |
+----------+
'data/raw/train.dvc' didn't change, skipping
'data/raw/val.dvc' didn't change, skipping
Running stage 'prepare':
> python src/prepare.py
Generating lock file 'dvc.lock'
Updating lock file 'dvc.lock'
Running stage 'train':
> python src/train.py
Updating lock file 'dvc.lock'
Running stage 'evaluate':
> python src/evaluate.py
Updating lock file 'dvc.lock'
Use `dvc push` to send your updates to remote storage.
dvc repro
runs the dvc.yaml
file in a Makefile fashion
First, it looks at the dependencies: the data didn’t change
Then it ran the commands to produce the outputs (since it is our first run, we had no outputs)
When the 1st stage is run, a dvc.lock
is created with information on that part of the run
When the 2nd and 3rd stages are run, dvc.lock
is updated
At the end of the run dvc.lock
contains all the info about the run we just did (version of the data used, etc.)
A new directory called runs
is created in .dvc/cache
with cached data for this run
The prepared data was created in data/prepared
(with a .gitignore
to exclude it from Git—you don’t want to track results in Git, but the scripts that can reproduce them)
A model was saved in model
(with another .gitignore
file)
The accuracy of this run was created in metrics
Now, we definitely want to create a commit with the dvc.lock
We could add the metrics resulting from this run in the same commit:
From now on, if we edit one of the scripts, or one of the dependencies, dvc status
will tell us what changed and dvc repro
will only rerun the parts of the pipeline to update the result, pretty much as a Makefile would
DVC is a sophisticated tool with many additional features: