DVC Heartbeat - Discord gems

Sveta · June 27, 2019, 7:10pm

Discord gems from the June Heartbeat

Does DVC support Azure Data Lake Gen1?

Azure data lake is HDFS compatible. And DVC supports HDFS remotes. Give it a try and let us know if you hit any problems here.

An excellent discussion on versioning tabular (SQL) data. Do you know of any tools that deal better with SQL-specific versioning?

It’s a wide topic. The actual solution might depend on a specific scenario and what exactly needs to be versioned. DVC does not provide any special functionality on top of databases to version their content.

Depending on your use case, our recommendation would be to run SQL and pull the result file (CSV/TSV file?) that then can be used to do analysis. This file can be taken under DVC control. Alternatively, in certain cases source files (that are used to populate the databases) can be taken under control and we can keep versions of them, or track incoming updates.

Read the discussion to learn more.

How does DVC do the versioning between binary files? Is there a binary diff, similar to git? Or is every version stored distinctly in full?

DVC is just saving every file as is, we don’t use binary diffs right now. There won’t be a full directory (if you added just a few files to a 10M files directory) duplication, though, since we treat every file inside as a separate entity.

Is there a way to pass parameters from e.g. dvc repro to stages?

The simplest option is to create a config file — json or whatnot — that your scripts would read and your stages depend on.

What is the best way to get cached output files from different branches simultaneously? For example, cached tensorboard files from different branches to compare experiments.

There is a way to do that through our (still not officially released) API pretty easily. Here is an example script how it could be done.

Docker and DVC. To being able to push/pull data we need to run a git clone to get DVC-files and remote definitions — but we worry that would make the container quite heavy (since it contains our entire project history).

You can do git clone — depth 1 , which will not download any history except the latest commits.

After DVC pushing the same file, it creates multiple copies of the same file. Is that how it’s supposed to work?

If you are pushing the same file, there are no copies pushed or saved in the cache. DVC is using checksums to identify files, so if you add the same file once again, it will detect that cache for it is already in the local cache and wont copy it again to cache. Same with dvc push , if it sees that you already have cache file with that checksum on your remote, it won’t upload it again.

How do I uninstall DVC on Mac (installed via pkg installer)?

Something like this should work:

$ which dvc
/usr/local/bin/dvc -> /usr/local/lib/dvc/dvc

$ ls -la /usr/local/bin/dvc
/usr/local/bin/dvc -> /usr/local/lib/dvc/dvc

sudo rm -f /usr/local/bin/dvc
sudo rm -rf /usr/local/lib/dvc
sudo pkgutil --forget com.iterative.dvc

How do I pull from a public S3 bucket (that contains DVC remote)?

Just add public URL of the bucket as an HTTP endpoint. See here for an example. https://remote.dvc.org/get-started is made to redirect to the S3 bucket anyone can read from.

I’m getting the same error over and over about locking: ERROR: failed to lock before running a command — cannot perform the cmd since DVC is busy and locked. Please retry the command later.

Most likely it happens due to an attempt to run DVC on NFS that has some configuration problems. There is a well known problem with DVC on NFS — sometimes it hangs on trying to lock a file. The usual workaround for this problem is to allocate DVC cache on NFS, but run the project ( git clone , DVC metafiles, etc) on the local file system. Read this answer to see how it can be setup.

Topic		Replies	Views
Trace back files in non-S3 bucket in structured way Questions	0	15	July 23, 2024
DVC - can’t I track directly an S3 remote data? Questions	1	1270	July 12, 2019
Integrate DVC to an existing github repo with S3 Questions	1	1079	October 18, 2021
Can and how DVC work with shared network storage? Questions	3	1580	January 1, 2020
Track remote data on Azure Questions	2	1029	March 11, 2022

DVC Heartbeat - Discord gems

Discord gems from the June Heartbeat

Related topics