DVC Heartbeat - Discord gems

DVC Discord gems

There are lots of hidden gems in our Discord community discussions. Sometimes they are scattered all over the channels and hard to track down.

We will be sifting through the issues and discussions and share the most interesting takeaways.

There is no separate guide for that, but it is very straight forward. See DVC file format description for how dvc file looks inside in general. All dvc add or dvc run does is just computing md5 fields in it, that is all. You could write your dvc file and then run dvc repro that will run a command(if any) and compute all needed checksums … read more

…There’s a ton of code in that project, and it’s very non-trivial to define the code dependencies for my training stage — there are a lot of imports going on, the training code is distributed across many modules … read more

DVC officially only supports regular Azure blob storage. Gen1 Data Lake should be accessible by the same interface, so configuring a regular azure remote for dvc should work. Seems like Gen2 Data Lake has disable blob API. If you know more details about the difference between Gen1 and Gen2, feel free to join our community and share this knowledge.

Apache 2.0. One of the most common and permissible OSS licences.

$ dvc remote add upstream s3://my-bucket
$ dvc remote modify upstream region REGION_NAME
$ dvc remote modify upstream endpointurl <url>

Find and click the S3 API compatible storage on this page

… it adds your datafiles there, that are tracked by dvc, so that you don’t accidentally add them to git as well you can open it with file editor of your liking and see your data files listed there.

… with dvc, you could connect your data sources from HDFS with your pipeline in your local project, by simply specifying it as an external dependency. For example let’s say your script process.cmd works on an input file on HDFS and then downloads a result to your local workspace, then with DVC it could look something like:

$ dvc run -d hdfs://example.com/home/shared/input -d process.cmd -o output process.cmd

read more.

If you have any questions, concerns or ideas, let us know!

2 Likes

Discord gems from the April Heartbeat:

  • It supports Windows, Mac, Linux. Python 2 and 3.

  • No specific CPU or RAM requirements — it’s a lightweight command line tool and should be able run pretty much everywhere you can run Python.

  • It depends on a few Python libraries that it installs as dependencies (they are specified in the requirements.txt).

  • It does not depend on Git and theoretically could be run without any SCM. Running it on top of a Git repository however is recommended and gives you an ability to actually save history of datasets, models, etc (even though it does not put them into Git directly).

No server licenses for DVC. It is 100% free and open source.

There is no limit. None enforced by DVC itself. It depends on the size of your local or remote storages. You need to have some space available on S3, your SSH server or other storage you are using to keep these data files, models and their version, which you would like to store.

DVC figures out the pipeline by looking at the dependencies and outputs of the stages. For example, having the following:

$ dvc run -f download.dvc \
          -o joke.txt \
          "curl https://geek-jokes.sameerkumar.website/api > joke.txt"
$ dvc run -f duplicate.dvc \
          -d joke.txt \
          -o dulpicate.txt \
          "cat joke.txt joke.txt > duplicate.txt"

you will end up with two stages: download.dvc and duplicate.dvc. The download one will have joke.txt as an output . The duplicate one defined joke.txt as a dependency, as it is the same file. DVC detects that and creates a pipeline by joining those stages.

You can inspect the content of each stage file here (they are human readable).

Yes! It’s a frequent scenario for multiple repos to share remotes and even local cache. DVC file serves as a link to the actual data. If you add the same DVC file (e.g. data.dvc) to the new repo and do dvc pull -r remotename data.dvc- it will fetch data. You have to use dvc remote add first to specify the coordinates of the remote storage you would like to share in every project. Alternatively (check out the question below), you could use — global to specify a single default remote (and/or cache dir) per machine.

Use — global when you specify the remote settings. Then remote will be visible for all projects on the same machine. — global — saves remote configuration to the global config (e.g. ~/.config/dvc/config) instead of a per project one — .dvc/config. See more details here.

We would recommend to skim through our get started tutorial, to summarize the data versioning process of DVC:

  • You create stage (aka DVC) files by adding, importing files (dvc add / dvc import) , or run a command to generate files (dvc run — out file.csv “wget https://example.com/file.csv").

  • This stage files are tracked by git

  • You use git to retrieve previous stage files (e.g. git checkout v1.0)

  • Then use dvc checkout to retrieve all the files related by those stage files

All your files (with each different version) are stored in a .dvc/cache directory, that you sync with a remote file storage (for example, S3) using the dvc push or dvc pull commands (analogous to a git push / git pull, but instead of syncing your .git, you are syncing your .dvc directory)

on a remote repository (let’s say an S3 bucket).

If you need to move your dvc file somewhere, it is pretty easy, even if done manually:

$ mv my.dvc data/my.dvc
# and now open my.dvc with your favorite editor and change wdir in it to 'wdir: ../'.

This is an expected behaviour. DVC saves files under the name created from their checksum in order to prevent duplication. If you delete “pushed” file in your project directory and perform dvc pull, dvc will take care of pulling the file and renaming it to “original” name.

Below are some details about how DVC’s cache works, just to illustrate the logic. When you add a data source:

$ echo "foo" > data.txt
$ dvc add data.txt

It computes the (md5) checksum of the file and generates a DVC file with related information:


md5: 3bccbf004063977442029334c3448687
outs:
- cache: true
  md5: d3b07384d113edec49eaa6238ad5ff00
  metric: false
  path: data.txt
wdir: ..

The original file is moved to the cache and a link or copy (depending on your filesystem) is created to replace it on your working space:

.dvc/cache
└── d3
    └── b07384d113edec49eaa6238ad5ff00

Absolutely! There are three ways you could interact with DVC:

  1. Use subprocess to launch DVC;

  2. Use from dvc.main import main and use it with regular CLI logic like ret = main(‘add’, ‘foo’)

  3. Use our internal API (see dvc/repo and dvc/command in our source to get a grasp of it). It is not officially public yet, and we don’t have any special docs for it, but it is fairly stable and could definitely be used for a POC. We’ll add docs and all the official stuff for it in the not-so-distant future.

There are two options:

  1. Use dvc add to track models and/or input datasets. It should be enough if you use git commit on DVC files produced by dvc add. This is the very minimum you can get with DVC and it does not require using DVC run. Check the first part (up to the Pipelines/Add transformations section) of the DVC get started.

  2. You could use — no-exec in dvc run and then just dvc commit and git commit the results. That way you’ll get your DVC files with all the linkages, without having to actually run your commands through DVC.

1 Like

Discord gems from the May Heartbeat

We feared that too until we met them in person. They appeared to be real (unless bots also love Ramen now)!

Every time you run dvc add to start tracking some data artifact, its path is automatically added to the .gitignore file, as a result it is hard to commit it to git by mistake — you would need to explicitly modify the .gitignore first. The feature to track some external data is called external outputs (if all you need is to track some data artifacts). Usually it is used when you have some data on S3 or SSH and don’t want to pull it into your working space, but it’s working even when your data is located on the same machine outside of the repository.

Use dvc import to track and download the remote data first time and next time when you do dvc repro if data has changed remotely. If you don’t want to track remote changes (lock the data after it was downloaded), use dvc run with a dummy dependency (any text file will do you do not touch) that runs an actual wget/curl to get the data.

Almost any command in DVC that deals with pipelines (set of DVC-files) accepts a single stage as a target, for example dvc pipeline show — ascii model.dvc .

It’s a well known problem with NFS, CIFS (Azure) — they do not support file locks properly which is required by the SQLLite engine to operate. The easiest workaround — don’t create a DVC project on network attached partition. In certain cases a fix can be made by changing mounting options, check this discussion for the Azure ML Service.

An excellent question! The short answer is:

dvc cache dir --local — to move your data to a big partition;

dvc config cache.type reflink, hardlink, symlink, copy — to enable symlinks to avoid actual copying;

dvc config cache.protected true — it’s highly recommended to make links in your working space read-only to avoid corrupting the cache;

To add your data first time to the DVC cache, do a clone of the repository on a big partition and run dvc add to add your data. Then you can do git pull, dvc pull on a small partition and DVC will create all the necessary links.

Usually it means that a parent directory of one of the arguments for dvc add / dvc run is already tracked. For example, you’ve added the whole datasets directory already. And now you are trying to add a subdirectory, which is already tracked as a part of the datasets one. No need to do that. You could dvc add datasets or dvc repro datasets.dvc to save changes.

Check the locale settings you have ( locale command in Linux). Python expects a locale that can handle unicode printing. Usually it’s solved with these commands: export LC_ALL=en_US.UTF-8 and export LANG=en_US.UTF-8 . You can place those exports into .bashrc or other file that defines your environment.

In short — yes, but it can be also configured. DVC is going to use either your default profile (from ~/.aws/* ) or your env vars by default. If you need more flexibility (e.g. you need to use different credentials for different projects, etc) check out this guide to configure custom aws profiles and then you could use them with DVC using these remote options.

You can use metrics show command XPath option and provide multiple attribute names to it:

$ dvc metrics add model.metrics --type json --xpath 
AUC_RATIO[train,valid]
metrics.json:
             0.89227482588
             0.856160272625

There are a few options to add a new dependency:

The only recommended way so far would be to somehow make DVC know about your package’s version. One way to do that would be to create a separate stage that would be dynamically printing version of that specific package into a file, that your stage would depend on:

dvc run -o mypkgver 'pip show mypkg > mypkgver’
dvc run -d mypkgver -d ... -o .. mycmd

Yes, you could dvc commit -f . It will save all current checksum without re-running your commands.

Yes! This DVC features is called external outputs and external dependencies. You can use one of them or both to track, process, and version your data on a cloud storage without downloading it locally.

Discord gems from the June Heartbeat

Azure data lake is HDFS compatible. And DVC supports HDFS remotes. Give it a try and let us know if you hit any problems here.

It’s a wide topic. The actual solution might depend on a specific scenario and what exactly needs to be versioned. DVC does not provide any special functionality on top of databases to version their content.

Depending on your use case, our recommendation would be to run SQL and pull the result file (CSV/TSV file?) that then can be used to do analysis. This file can be taken under DVC control. Alternatively, in certain cases source files (that are used to populate the databases) can be taken under control and we can keep versions of them, or track incoming updates.

Read the discussion to learn more.

DVC is just saving every file as is, we don’t use binary diffs right now. There won’t be a full directory (if you added just a few files to a 10M files directory) duplication, though, since we treat every file inside as a separate entity.

The simplest option is to create a config file — json or whatnot — that your scripts would read and your stages depend on.

There is a way to do that through our (still not officially released) API pretty easily. Here is an example script how it could be done.

  • Docker and DVC. To being able to push/pull data we need to run a git clone to get DVC-files and remote definitions — but we worry that would make the container quite heavy (since it contains our entire project history).

You can do git clone — depth 1 , which will not download any history except the latest commits.

If you are pushing the same file, there are no copies pushed or saved in the cache. DVC is using checksums to identify files, so if you add the same file once again, it will detect that cache for it is already in the local cache and wont copy it again to cache. Same with dvc push , if it sees that you already have cache file with that checksum on your remote, it won’t upload it again.

Something like this should work:

$ which dvc
/usr/local/bin/dvc -> /usr/local/lib/dvc/dvc

$ ls -la /usr/local/bin/dvc
/usr/local/bin/dvc -> /usr/local/lib/dvc/dvc

sudo rm -f /usr/local/bin/dvc
sudo rm -rf /usr/local/lib/dvc
sudo pkgutil --forget com.iterative.dvc

Just add public URL of the bucket as an HTTP endpoint. See here for an example. https://remote.dvc.org/get-started is made to redirect to the S3 bucket anyone can read from.

Most likely it happens due to an attempt to run DVC on NFS that has some configuration problems. There is a well known problem with DVC on NFS — sometimes it hangs on trying to lock a file. The usual workaround for this problem is to allocate DVC cache on NFS, but run the project ( git clone , DVC metafiles, etc) on the local file system. Read this answer to see how it can be setup.

2 Likes