Hi, I 400 GB of images and videos inside a self-hosted minIO s3 that I use as a DVC repository. I also have the git repo with all the data in my local. when I add a few images to the data in my local and run dvc add, it takes a long time and also running dvc status takes a long time to run. before this when I used git lfs to track my data both git add and git status ran almost instantly. is it normal or am I doing something wrong? this is the process I go through when I modify some data in my git repo which is integrated with DVC:
Hi, it is hard to say without knowing the amount of files you have. But yes, dvc is slow for 400 GB of files.
when I add a few images to the data in my local and run dvc add, it takes a long time
When updating an existing dataset, instead of using dvc add data, you can selectively ask dvc to update the part of the dataset that was changed, it’ll be faster that way.
eg:
dvc add data/images/00/
If you can, please provide a profiling data. You can generate it as follows: