DVC Heartbeat - Discord gems

Discord gems from the June Heartbeat

Azure data lake is HDFS compatible. And DVC supports HDFS remotes. Give it a try and let us know if you hit any problems here.

It’s a wide topic. The actual solution might depend on a specific scenario and what exactly needs to be versioned. DVC does not provide any special functionality on top of databases to version their content.

Depending on your use case, our recommendation would be to run SQL and pull the result file (CSV/TSV file?) that then can be used to do analysis. This file can be taken under DVC control. Alternatively, in certain cases source files (that are used to populate the databases) can be taken under control and we can keep versions of them, or track incoming updates.

Read the discussion to learn more.

DVC is just saving every file as is, we don’t use binary diffs right now. There won’t be a full directory (if you added just a few files to a 10M files directory) duplication, though, since we treat every file inside as a separate entity.

The simplest option is to create a config file — json or whatnot — that your scripts would read and your stages depend on.

There is a way to do that through our (still not officially released) API pretty easily. Here is an example script how it could be done.

  • Docker and DVC. To being able to push/pull data we need to run a git clone to get DVC-files and remote definitions — but we worry that would make the container quite heavy (since it contains our entire project history).

You can do git clone — depth 1 , which will not download any history except the latest commits.

If you are pushing the same file, there are no copies pushed or saved in the cache. DVC is using checksums to identify files, so if you add the same file once again, it will detect that cache for it is already in the local cache and wont copy it again to cache. Same with dvc push , if it sees that you already have cache file with that checksum on your remote, it won’t upload it again.

Something like this should work:

$ which dvc
/usr/local/bin/dvc -> /usr/local/lib/dvc/dvc

$ ls -la /usr/local/bin/dvc
/usr/local/bin/dvc -> /usr/local/lib/dvc/dvc

sudo rm -f /usr/local/bin/dvc
sudo rm -rf /usr/local/lib/dvc
sudo pkgutil --forget com.iterative.dvc

Just add public URL of the bucket as an HTTP endpoint. See here for an example. https://remote.dvc.org/get-started is made to redirect to the S3 bucket anyone can read from.

Most likely it happens due to an attempt to run DVC on NFS that has some configuration problems. There is a well known problem with DVC on NFS — sometimes it hangs on trying to lock a file. The usual workaround for this problem is to allocate DVC cache on NFS, but run the project ( git clone , DVC metafiles, etc) on the local file system. Read this answer to see how it can be setup.

2 Likes