Since a couple of days, my dvc-pipeline crashes, and I have the suspicion that this is during the dvc-caching of one of my CI-pipeline jobs. This all happens during dvc repro (dvc_repo.reproduce) is executed in a python script to run our dvc stages.
I’m only able to reproduce this problem on our main server, but it works on my laptop and some other servers.
Based on python-logging, I’m sure that this happens while “Computing file/dir hashes (only done once)”. Thereby the memory increases steadily and my pipeline crashes, and I can see that more than 100GB of memory (RAM) have been consumed. The Issue only appeared while I used many files as data.
Here are the versions im using:
dvc=2.9.3
The data set being cached is roughly 5,7GB in size and has 83539-thousand files
Anyone experienced something similar, or can help me with the profiling of the DVC internals?
I have now been able to reproduce the error on the server and my laptop.
Here is some background information. We have a remote location in the project for the data where the DVC cache is stored. The version to be used for our project and with the path to the cache are stored in a git repository (data registry project). So the cache and the data version we use in the project is fetched from the registry project. In our DVC cache, we currently have states created only with dvc=2.3.0.
In our pipeline this data is checked out and updated in the first step for our project. I commented out this step for running on laptop and server to save time and replaced it with a one-time manual dvc update. After I added this automated dvc update back to the pipeline for an experiment, the known error also occurred on laptop and server. I now create a dataset with dvc=2.9.5 in the DVC remote and will continue to use this in the project itself. I hope to solve the caching problem with it.
The graphic appended shows our data constellation:
Therby we have a data storage where datasets raw, dev and subsample are contained. The state of the raw data is known by data.dvc of the data registry and is updated continuously. The dvc pipeline in data registry creates as output 2 more folders a subsample and a dev set of the original full dataset. The state of this subsets is stored in the dvc.lock file.
We then use the datasets created in data registry in our actual project as basis. In our actual project we get the data we need for execution with a dvc update
After some testing I figured out that the first dvc version the issues with ram explosions appeared first was dvc=2.6.0
I also tested dvc=2.10.0 and got other issues.
Could it be that the Issue came with sshfs or fsspec?
This code is executed in a docker container and I tracked the memory usage with htop. With dvc > 2.5.4 the memory slowly increases and don’t stop. After the memory is exceeded I get /bin/bash: line 157: 290 Killed in our logs. This memory increase do not appear if I use dvc >= 2.5.4. Also this seems not to happen if I skip the .update part and execute it manually before executing python script.py.
The incidence itself occurs while “Computing file/dir hashes (only done once)” in dvc repro execution. It seems that there is some kind of memory leak while computing.
@daavoo Yes I use dvc.repo API directly in my case. If I execute the dvc update file_1.dvc (dvc=2.9.5) in command line it seems to cause the same RAM-Issues but while updating computation. The latest dvc hash was done with dvc=2.9.5 all others with dvc=2.3.0. Is it possible that this inconsistency cause this Issues?
I executed script.py with fil-profile run and got the outputs. How can I send them to you?
Additional my htop after 538 hashes were done displayed the memory allocated is 62,8GB and
after 872 hashes were done 86,5GB
It is to say, that our hash is very huge cause we did many experiments with it.