DVC Heartbeat - Discord gems

Sveta · June 27, 2019, 7:05pm

Discord gems from the April Heartbeat:

What are the system requirements to install DVC (type of operating system, dependencies of another application (as GIT), memory, cpu, etc).

It supports Windows, Mac, Linux. Python 2 and 3.
No specific CPU or RAM requirements — it’s a lightweight command line tool and should be able run pretty much everywhere you can run Python.
It depends on a few Python libraries that it installs as dependencies (they are specified in the requirements.txt).
It does not depend on Git and theoretically could be run without any SCM. Running it on top of a Git repository however is recommended and gives you an ability to actually save history of datasets, models, etc (even though it does not put them into Git directly).

Do I have to buy a server license to run DVC, do you have this?

No server licenses for DVC. It is 100% free and open source.

What is the storage limit when using DVC? I am trying to version control datasets and models with >10 GB (Potentially even bigger). Can DVC handle this?

There is no limit. None enforced by DVC itself. It depends on the size of your local or remote storages. You need to have some space available on S3, your SSH server or other storage you are using to keep these data files, models and their version, which you would like to store.

How does DVC know the sequence of stages to run ? How does it connect them? Does it see that there is a dependency which is outputted from the first run?

DVC figures out the pipeline by looking at the dependencies and outputs of the stages. For example, having the following:

$ dvc run -f download.dvc \
          -o joke.txt \
          "curl https://geek-jokes.sameerkumar.website/api > joke.txt"
$ dvc run -f duplicate.dvc \
          -d joke.txt \
          -o dulpicate.txt \
          "cat joke.txt joke.txt > duplicate.txt"

you will end up with two stages: download.dvc and duplicate.dvc. The download one will have joke.txt as an output . The duplicate one defined joke.txt as a dependency, as it is the same file. DVC detects that and creates a pipeline by joining those stages.

You can inspect the content of each stage file here (they are human readable).

Is it possible to use the same data of a remote in two different repositories? (e.g. in one repo run dvc pull -r my_remote to pull some data and running the same command in a different git repo should also pull the same)

Yes! It’s a frequent scenario for multiple repos to share remotes and even local cache. DVC file serves as a link to the actual data. If you add the same DVC file (e.g. data.dvc) to the new repo and do dvc pull -r remotename data.dvc- it will fetch data. You have to use dvc remote add first to specify the coordinates of the remote storage you would like to share in every project. Alternatively (check out the question below), you could use — global to specify a single default remote (and/or cache dir) per machine.

Could I set a global remote server, instead of config in each project?

Use — global when you specify the remote settings. Then remote will be visible for all projects on the same machine. — global — saves remote configuration to the global config (e.g. ~/.config/dvc/config) instead of a per project one — .dvc/config. See more details here.

How do I version a large dataset in S3 or any other storage?

We would recommend to skim through our get started tutorial, to summarize the data versioning process of DVC:

You create stage (aka DVC) files by adding, importing files (dvc add / dvc import) , or run a command to generate files (dvc run — out file.csv “wget https://example.com/file.csv").
This stage files are tracked by git
You use git to retrieve previous stage files (e.g. git checkout v1.0)
Then use dvc checkout to retrieve all the files related by those stage files

All your files (with each different version) are stored in a .dvc/cache directory, that you sync with a remote file storage (for example, S3) using the dvc push or dvc pull commands (analogous to a git push / git pull, but instead of syncing your .git, you are syncing your .dvc directory)

on a remote repository (let’s say an S3 bucket).

How do I move/rename DVC files?

If you need to move your dvc file somewhere, it is pretty easy, even if done manually:

$ mv my.dvc data/my.dvc
# and now open my.dvc with your favorite editor and change wdir in it to 'wdir: ../'.

I performed ‘dvc push’ of a file to a remote. On the remote there is created a directory called ‘8f’ with a file inside called ‘2ec34faf91ff15ef64abf3fbffa7ee’. The original csv file doesn’t appear on the remote. Is that expected behaviour?

This is an expected behaviour. DVC saves files under the name created from their checksum in order to prevent duplication. If you delete “pushed” file in your project directory and perform dvc pull, dvc will take care of pulling the file and renaming it to “original” name.

Below are some details about how DVC’s cache works, just to illustrate the logic. When you add a data source:

$ echo "foo" > data.txt
$ dvc add data.txt

It computes the (md5) checksum of the file and generates a DVC file with related information:


md5: 3bccbf004063977442029334c3448687
outs:
- cache: true
  md5: d3b07384d113edec49eaa6238ad5ff00
  metric: false
  path: data.txt
wdir: ..

The original file is moved to the cache and a link or copy (depending on your filesystem) is created to replace it on your working space:

.dvc/cache
└── d3
    └── b07384d113edec49eaa6238ad5ff00

Is it possible to integrate dvc with our in-house tools developed in Python?

Absolutely! There are three ways you could interact with DVC:

Use subprocess to launch DVC;
Use from dvc.main import main and use it with regular CLI logic like ret = main(‘add’, ‘foo’)
Use our internal API (see dvc/repo and dvc/command in our source to get a grasp of it). It is not officially public yet, and we don’t have any special docs for it, but it is fairly stable and could definitely be used for a POC. We’ll add docs and all the official stuff for it in the not-so-distant future.

Can I still track the linkage between data and model without using dvc run and a graph of tasks? Basically what would like extremely minimal DVC invasion into my GIT repo for an existing machine learning application?

There are two options:

Use dvc add to track models and/or input datasets. It should be enough if you use git commit on DVC files produced by dvc add. This is the very minimum you can get with DVC and it does not require using DVC run. Check the first part (up to the Pipelines/Add transformations section) of the DVC get started.
You could use — no-exec in dvc run and then just dvc commit and git commit the results. That way you’ll get your DVC files with all the linkages, without having to actually run your commands through DVC.

Topic		Replies	Views
Trace back files in non-S3 bucket in structured way Questions	0	15	July 23, 2024
DVC - can’t I track directly an S3 remote data? Questions	1	1267	July 12, 2019
Integrate DVC to an existing github repo with S3 Questions	1	1076	October 18, 2021
Can and how DVC work with shared network storage? Questions	3	1576	January 1, 2020
Track remote data on Azure Questions	2	1026	March 11, 2022

DVC Heartbeat - Discord gems

Discord gems from the April Heartbeat:

Related topics