DVC Heartbeat - Discord gems

Sveta · March 22, 2019, 12:42am

DVC Discord gems

There are lots of hidden gems in our Discord community discussions. Sometimes they are scattered all over the channels and hard to track down.

We will be sifting through the issues and discussions and share the most interesting takeaways.

Edit and define DVC files manually, in a Makefile style

There is no separate guide for that, but it is very straight forward. See DVC file format description for how dvc file looks inside in general. All dvc add or dvc run does is just computing md5 fields in it, that is all. You could write your dvc file and then run dvc repro that will run a command(if any) and compute all needed checksums … read more

Best practices to define the code dependencies

…There’s a ton of code in that project, and it’s very non-trivial to define the code dependencies for my training stage — there are a lot of imports going on, the training code is distributed across many modules … read more

Azure data lake support

DVC officially only supports regular Azure blob storage. Gen1 Data Lake should be accessible by the same interface, so configuring a regular azure remote for dvc should work. Seems like Gen2 Data Lake has disable blob API. If you know more details about the difference between Gen1 and Gen2, feel free to join our community and share this knowledge.

What licence DVC is released under

Apache 2.0. One of the most common and permissible OSS licences.

Setting up S3 compatible remote ( Localstack , wasabi )

$ dvc remote add upstream s3://my-bucket
$ dvc remote modify upstream region REGION_NAME
$ dvc remote modify upstream endpointurl &lt;url&gt;

Find and click the S3 API compatible storage on this page

Why DVC creates and updates .gitignore file

… it adds your datafiles there, that are tracked by dvc, so that you don’t accidentally add them to git as well you can open it with file editor of your liking and see your data files listed there.

Managing data and pipelines with DVC on HDFS

… with dvc, you could connect your data sources from HDFS with your pipeline in your local project, by simply specifying it as an external dependency. For example let’s say your script process.cmd works on an input file on HDFS and then downloads a result to your local workspace, then with DVC it could look something like:

$ dvc run -d hdfs://example.com/home/shared/input -d process.cmd -o output process.cmd

If you have any questions, concerns or ideas, let us know!

Sveta · June 27, 2019, 7:05pm

Discord gems from the April Heartbeat:

What are the system requirements to install DVC (type of operating system, dependencies of another application (as GIT), memory, cpu, etc).

It supports Windows, Mac, Linux. Python 2 and 3.
No specific CPU or RAM requirements — it’s a lightweight command line tool and should be able run pretty much everywhere you can run Python.
It depends on a few Python libraries that it installs as dependencies (they are specified in the requirements.txt).
It does not depend on Git and theoretically could be run without any SCM. Running it on top of a Git repository however is recommended and gives you an ability to actually save history of datasets, models, etc (even though it does not put them into Git directly).

Do I have to buy a server license to run DVC, do you have this?

No server licenses for DVC. It is 100% free and open source.

What is the storage limit when using DVC? I am trying to version control datasets and models with >10 GB (Potentially even bigger). Can DVC handle this?

There is no limit. None enforced by DVC itself. It depends on the size of your local or remote storages. You need to have some space available on S3, your SSH server or other storage you are using to keep these data files, models and their version, which you would like to store.

How does DVC know the sequence of stages to run ? How does it connect them? Does it see that there is a dependency which is outputted from the first run?

DVC figures out the pipeline by looking at the dependencies and outputs of the stages. For example, having the following:

$ dvc run -f download.dvc \
          -o joke.txt \
          "curl https://geek-jokes.sameerkumar.website/api > joke.txt"
$ dvc run -f duplicate.dvc \
          -d joke.txt \
          -o dulpicate.txt \
          "cat joke.txt joke.txt > duplicate.txt"

you will end up with two stages: download.dvc and duplicate.dvc. The download one will have joke.txt as an output . The duplicate one defined joke.txt as a dependency, as it is the same file. DVC detects that and creates a pipeline by joining those stages.

You can inspect the content of each stage file here (they are human readable).

Is it possible to use the same data of a remote in two different repositories? (e.g. in one repo run dvc pull -r my_remote to pull some data and running the same command in a different git repo should also pull the same)

Yes! It’s a frequent scenario for multiple repos to share remotes and even local cache. DVC file serves as a link to the actual data. If you add the same DVC file (e.g. data.dvc) to the new repo and do dvc pull -r remotename data.dvc- it will fetch data. You have to use dvc remote add first to specify the coordinates of the remote storage you would like to share in every project. Alternatively (check out the question below), you could use — global to specify a single default remote (and/or cache dir) per machine.

Could I set a global remote server, instead of config in each project?

Use — global when you specify the remote settings. Then remote will be visible for all projects on the same machine. — global — saves remote configuration to the global config (e.g. ~/.config/dvc/config) instead of a per project one — .dvc/config. See more details here.

How do I version a large dataset in S3 or any other storage?

We would recommend to skim through our get started tutorial, to summarize the data versioning process of DVC:

You create stage (aka DVC) files by adding, importing files (dvc add / dvc import) , or run a command to generate files (dvc run — out file.csv “wget https://example.com/file.csv").
This stage files are tracked by git
You use git to retrieve previous stage files (e.g. git checkout v1.0)
Then use dvc checkout to retrieve all the files related by those stage files

All your files (with each different version) are stored in a .dvc/cache directory, that you sync with a remote file storage (for example, S3) using the dvc push or dvc pull commands (analogous to a git push / git pull, but instead of syncing your .git, you are syncing your .dvc directory)

on a remote repository (let’s say an S3 bucket).

How do I move/rename DVC files?

If you need to move your dvc file somewhere, it is pretty easy, even if done manually:

$ mv my.dvc data/my.dvc
# and now open my.dvc with your favorite editor and change wdir in it to 'wdir: ../'.

I performed ‘dvc push’ of a file to a remote. On the remote there is created a directory called ‘8f’ with a file inside called ‘2ec34faf91ff15ef64abf3fbffa7ee’. The original csv file doesn’t appear on the remote. Is that expected behaviour?

This is an expected behaviour. DVC saves files under the name created from their checksum in order to prevent duplication. If you delete “pushed” file in your project directory and perform dvc pull, dvc will take care of pulling the file and renaming it to “original” name.

Below are some details about how DVC’s cache works, just to illustrate the logic. When you add a data source:

$ echo "foo" > data.txt
$ dvc add data.txt

It computes the (md5) checksum of the file and generates a DVC file with related information:


md5: 3bccbf004063977442029334c3448687
outs:
- cache: true
  md5: d3b07384d113edec49eaa6238ad5ff00
  metric: false
  path: data.txt
wdir: ..

The original file is moved to the cache and a link or copy (depending on your filesystem) is created to replace it on your working space:

.dvc/cache
└── d3
    └── b07384d113edec49eaa6238ad5ff00

Is it possible to integrate dvc with our in-house tools developed in Python?

Absolutely! There are three ways you could interact with DVC:

Use subprocess to launch DVC;
Use from dvc.main import main and use it with regular CLI logic like ret = main(‘add’, ‘foo’)
Use our internal API (see dvc/repo and dvc/command in our source to get a grasp of it). It is not officially public yet, and we don’t have any special docs for it, but it is fairly stable and could definitely be used for a POC. We’ll add docs and all the official stuff for it in the not-so-distant future.

Can I still track the linkage between data and model without using dvc run and a graph of tasks? Basically what would like extremely minimal DVC invasion into my GIT repo for an existing machine learning application?

There are two options:

Use dvc add to track models and/or input datasets. It should be enough if you use git commit on DVC files produced by dvc add. This is the very minimum you can get with DVC and it does not require using DVC run. Check the first part (up to the Pipelines/Add transformations section) of the DVC get started.
You could use — no-exec in dvc run and then just dvc commit and git commit the results. That way you’ll get your DVC files with all the linkages, without having to actually run your commands through DVC.

Sveta · June 27, 2019, 7:08pm

Discord gems from the May Heartbeat

This might be a favourite gem of ours — our engineers are so fast that someone assumed they were bots.

We feared that too until we met them in person. They appeared to be real (unless bots also love Ramen now)!

Is this the best way to track data with DVC when code and data are separate? Having being burned by this a couple of times, i.e accidentally pushing large files to GitHub, I now keep my code and data separate

Every time you run dvc add to start tracking some data artifact, its path is automatically added to the .gitignore file, as a result it is hard to commit it to git by mistake — you would need to explicitly modify the .gitignore first. The feature to track some external data is called external outputs (if all you need is to track some data artifacts). Usually it is used when you have some data on S3 or SSH and don’t want to pull it into your working space, but it’s working even when your data is located on the same machine outside of the repository.

How do I wrap a step that downloads a file/directory into a DVC stage? I want to ensure that it runs only if file has no been downloaded yet

Use dvc import to track and download the remote data first time and next time when you do dvc repro if data has changed remotely. If you don’t want to track remote changes (lock the data after it was downloaded), use dvc run with a dummy dependency (any text file will do you do not touch) that runs an actual wget/curl to get the data.

How do I show a pipeline that does not have a default Dvcfile ? (e.g. I assigned all files names manually with -f in the dvc run command and I just don’t have Dvcfile anymore)

Almost any command in DVC that deals with pipelines (set of DVC-files) accepts a single stage as a target, for example dvc pipeline show — ascii model.dvc .

DVC hangs or I’m getting “database is locked” issue

It’s a well known problem with NFS, CIFS (Azure) — they do not support file locks properly which is required by the SQLLite engine to operate. The easiest workaround — don’t create a DVC project on network attached partition. In certain cases a fix can be made by changing mounting options, check this discussion for the Azure ML Service.

H ow do I use DVC if I use a separate drive to store the data (and a small/fast SSD to run computations)? I don’t have enough space to bring data to my working space

An excellent question! The short answer is:

dvc cache dir --local — to move your data to a big partition;

dvc config cache.type reflink, hardlink, symlink, copy — to enable symlinks to avoid actual copying;

dvc config cache.protected true — it’s highly recommended to make links in your working space read-only to avoid corrupting the cache;

To add your data first time to the DVC cache, do a clone of the repository on a big partition and run dvc add to add your data. Then you can do git pull, dvc pull on a small partition and DVC will create all the necessary links.

Why I’m getting Paths for outs overlap error when I run dvc add or dvc run?

Usually it means that a parent directory of one of the arguments for dvc add / dvc run is already tracked. For example, you’ve added the whole datasets directory already. And now you are trying to add a subdirectory, which is already tracked as a part of the datasets one. No need to do that. You could dvc add datasets or dvc repro datasets.dvc to save changes.

I’m getting ascii codec can’t encode character error on DVC commands when I deal with unicode file names

Check the locale settings you have ( locale command in Linux). Python expects a locale that can handle unicode printing. Usually it’s solved with these commands: export LC_ALL=en_US.UTF-8 and export LANG=en_US.UTF-8 . You can place those exports into .bashrc or other file that defines your environment.

Does DVC use the same logins aws-cli has when using an S3 bucket as its repo/remote storage ?

In short — yes, but it can be also configured. DVC is going to use either your default profile (from ~/.aws/* ) or your env vars by default. If you need more flexibility (e.g. you need to use different credentials for different projects, etc) check out this guide to configure custom aws profiles and then you could use them with DVC using these remote options.

How can I output multiple metrics from a single file? Let’s say I have the following in a file: {“AUC_RATIO”: {“train”: 0.8922748258797667, “valid”: 0.8561602726251776, “xval”: 0.8843431199314923}}. How can I show both train and valid without xval ?

You can use metrics show command XPath option and provide multiple attribute names to it:

$ dvc metrics add model.metrics --type json --xpath 
AUC_RATIO[train,valid]
metrics.json:
             0.89227482588
             0.856160272625

What is the quickest way to add a new dependency to a DVC-file?

There are a few options to add a new dependency:

simply opening a file with your favorite editor and adding a dependency there without md5 . DVC will understand that that stage is changed and will re-run and re-calculate md5 checksums during the next dvc repro ;
use dvc run --no-exec is another option. It will rewrite the existing file for you with new parameters.
Is there a way to add a dependency to a python package, so it runs a stage again if it imported the updated library?

The only recommended way so far would be to somehow make DVC know about your package’s version. One way to do that would be to create a separate stage that would be dynamically printing version of that specific package into a file, that your stage would depend on:

dvc run -o mypkgver 'pip show mypkg > mypkgver’
dvc run -d mypkgver -d ... -o .. mycmd

Is there anyway to forcibly recompute the hashes of dependencies in a pipeline DVC-file? E.g. I made some whitespace/comment changes in my code and I want to tell dvc “it’s ok, you don’t have to recompute everything”.

Yes, you could dvc commit -f . It will save all current checksum without re-running your commands.

I have projects that use data that’s stored in S3. I never have data locally to use dvc push, but I would like to have this data version controlled. Is there a way to use the features of DVC in this use case?

Yes! This DVC features is called external outputs and external dependencies. You can use one of them or both to track, process, and version your data on a cloud storage without downloading it locally.

Sveta · June 27, 2019, 7:10pm

Discord gems from the June Heartbeat

Does DVC support Azure Data Lake Gen1?

Azure data lake is HDFS compatible. And DVC supports HDFS remotes. Give it a try and let us know if you hit any problems here.

An excellent discussion on versioning tabular (SQL) data. Do you know of any tools that deal better with SQL-specific versioning?

It’s a wide topic. The actual solution might depend on a specific scenario and what exactly needs to be versioned. DVC does not provide any special functionality on top of databases to version their content.

Depending on your use case, our recommendation would be to run SQL and pull the result file (CSV/TSV file?) that then can be used to do analysis. This file can be taken under DVC control. Alternatively, in certain cases source files (that are used to populate the databases) can be taken under control and we can keep versions of them, or track incoming updates.

Read the discussion to learn more.

How does DVC do the versioning between binary files? Is there a binary diff, similar to git? Or is every version stored distinctly in full?

DVC is just saving every file as is, we don’t use binary diffs right now. There won’t be a full directory (if you added just a few files to a 10M files directory) duplication, though, since we treat every file inside as a separate entity.

Is there a way to pass parameters from e.g. dvc repro to stages?

The simplest option is to create a config file — json or whatnot — that your scripts would read and your stages depend on.

What is the best way to get cached output files from different branches simultaneously? For example, cached tensorboard files from different branches to compare experiments.

There is a way to do that through our (still not officially released) API pretty easily. Here is an example script how it could be done.

Docker and DVC. To being able to push/pull data we need to run a git clone to get DVC-files and remote definitions — but we worry that would make the container quite heavy (since it contains our entire project history).

You can do git clone — depth 1 , which will not download any history except the latest commits.

After DVC pushing the same file, it creates multiple copies of the same file. Is that how it’s supposed to work?

If you are pushing the same file, there are no copies pushed or saved in the cache. DVC is using checksums to identify files, so if you add the same file once again, it will detect that cache for it is already in the local cache and wont copy it again to cache. Same with dvc push , if it sees that you already have cache file with that checksum on your remote, it won’t upload it again.

How do I uninstall DVC on Mac (installed via pkg installer)?

Something like this should work:

$ which dvc
/usr/local/bin/dvc -> /usr/local/lib/dvc/dvc

$ ls -la /usr/local/bin/dvc
/usr/local/bin/dvc -> /usr/local/lib/dvc/dvc

sudo rm -f /usr/local/bin/dvc
sudo rm -rf /usr/local/lib/dvc
sudo pkgutil --forget com.iterative.dvc

How do I pull from a public S3 bucket (that contains DVC remote)?

Just add public URL of the bucket as an HTTP endpoint. See here for an example. https://remote.dvc.org/get-started is made to redirect to the S3 bucket anyone can read from.

I’m getting the same error over and over about locking: ERROR: failed to lock before running a command — cannot perform the cmd since DVC is busy and locked. Please retry the command later.

Most likely it happens due to an attempt to run DVC on NFS that has some configuration problems. There is a well known problem with DVC on NFS — sometimes it hangs on trying to lock a file. The usual workaround for this problem is to allocate DVC cache on NFS, but run the project ( git clone , DVC metafiles, etc) on the local file system. Read this answer to see how it can be setup.

Topic		Replies	Views
Trace back files in non-S3 bucket in structured way Questions	0	15	July 23, 2024
DVC - can’t I track directly an S3 remote data? Questions	1	1270	July 12, 2019
Integrate DVC to an existing github repo with S3 Questions	1	1078	October 18, 2021
Can and how DVC work with shared network storage? Questions	3	1579	January 1, 2020
Track remote data on Azure Questions	2	1026	March 11, 2022

DVC Heartbeat - Discord gems

DVC Discord gems

Discord gems from the April Heartbeat:

Discord gems from the May Heartbeat

Discord gems from the June Heartbeat

Related topics