Discord gems from the June Heartbeat
Azure data lake is HDFS compatible. And DVC supports HDFS remotes. Give it a try and let us know if you hit any problems here.
- An excellent discussion on versioning tabular (SQL) data. Do you know of any tools that deal better with SQL-specific versioning?
It’s a wide topic. The actual solution might depend on a specific scenario and what exactly needs to be versioned. DVC does not provide any special functionality on top of databases to version their content.
Depending on your use case, our recommendation would be to run SQL and pull the result file (CSV/TSV file?) that then can be used to do analysis. This file can be taken under DVC control. Alternatively, in certain cases source files (that are used to populate the databases) can be taken under control and we can keep versions of them, or track incoming updates.
Read the discussion to learn more.
- How does DVC do the versioning between binary files? Is there a binary diff, similar to git? Or is every version stored distinctly in full?
DVC is just saving every file as is, we don’t use binary diffs right now. There won’t be a full directory (if you added just a few files to a 10M files directory) duplication, though, since we treat every file inside as a separate entity.
The simplest option is to create a config file — json or whatnot — that your scripts would read and your stages depend on.
- What is the best way to get cached output files from different branches simultaneously? For example, cached tensorboard files from different branches to compare experiments.
There is a way to do that through our (still not officially released) API pretty easily. Here is an example script how it could be done.
- Docker and DVC. To being able to push/pull data we need to run a git clone to get DVC-files and remote definitions — but we worry that would make the container quite heavy (since it contains our entire project history).
You can do git clone — depth 1
, which will not download any history except the latest commits.
If you are pushing the same file, there are no copies pushed or saved in the cache. DVC is using checksums to identify files, so if you add the same file once again, it will detect that cache for it is already in the local cache and wont copy it again to cache. Same with dvc push
, if it sees that you already have cache file with that checksum on your remote, it won’t upload it again.
Something like this should work:
$ which dvc
/usr/local/bin/dvc -> /usr/local/lib/dvc/dvc
$ ls -la /usr/local/bin/dvc
/usr/local/bin/dvc -> /usr/local/lib/dvc/dvc
sudo rm -f /usr/local/bin/dvc
sudo rm -rf /usr/local/lib/dvc
sudo pkgutil --forget com.iterative.dvc
Just add public URL of the bucket as an HTTP endpoint. See here for an example. https://remote.dvc.org/get-started is made to redirect to the S3 bucket anyone can read from.
-
I’m getting the same error over and over about locking:
ERROR: failed to lock before running a command — cannot perform the cmd since DVC is busy and locked. Please retry the command later.
Most likely it happens due to an attempt to run DVC on NFS that has some configuration problems. There is a well known problem with DVC on NFS — sometimes it hangs on trying to lock a file. The usual workaround for this problem is to allocate DVC cache on NFS, but run the project ( git clone
, DVC metafiles, etc) on the local file system. Read this answer to see how it can be setup.