Large dataset let ram memory explode during caching

Hi,

Since a couple of days, my dvc-pipeline crashes, and I have the suspicion that this is during the dvc-caching of one of my CI-pipeline jobs. This all happens during dvc repro (dvc_repo.reproduce) is executed in a python script to run our dvc stages.

I’m only able to reproduce this problem on our main server, but it works on my laptop and some other servers.

Based on python-logging, I’m sure that this happens while “Computing file/dir hashes (only done once)”. Thereby the memory increases steadily and my pipeline crashes, and I can see that more than 100GB of memory (RAM) have been consumed. The Issue only appeared while I used many files as data.

Here are the versions im using:

dvc=2.9.3

The data set being cached is roughly 5,7GB in size and has 83539-thousand files

Anyone experienced something similar, or can help me with the profiling of the DVC internals?

Addendum, the pipeline works with dvc-version 2.3.0. dvc-version 2.9.5 seems to cause the same error as dvc 2.9.3.

I have now been able to reproduce the error on the server and my laptop.

Here is some background information. We have a remote location in the project for the data where the DVC cache is stored. The version to be used for our project and with the path to the cache are stored in a git repository (data registry project). So the cache and the data version we use in the project is fetched from the registry project. In our DVC cache, we currently have states created only with dvc=2.3.0.

In our pipeline this data is checked out and updated in the first step for our project. I commented out this step for running on laptop and server to save time and replaced it with a one-time manual dvc update. After I added this automated dvc update back to the pipeline for an experiment, the known error also occurred on laptop and server. I now create a dataset with dvc=2.9.5 in the DVC remote and will continue to use this in the project itself. I hope to solve the caching problem with it.

The graphic appended shows our data constellation:

grafik

Therby we have a data storage where datasets raw, dev and subsample are contained. The state of the raw data is known by data.dvc of the data registry and is updated continuously. The dvc pipeline in data registry creates as output 2 more folders a subsample and a dev set of the original full dataset. The state of this subsets is stored in the dvc.lock file.

We then use the datasets created in data registry in our actual project as basis. In our actual project we get the data we need for execution with a dvc update

After some testing I figured out that the first dvc version the issues with ram explosions appeared first was dvc=2.6.0

I also tested dvc=2.10.0 and got other issues.

Could it be that the Issue came with sshfs or fsspec?

Thanks for the detailed report @Jens , and sorry for late reply. I will try to reproduce and profile and get back to you

@Jens Could you share the traceback of the command causing the memory error?

This will be difficult cause no error is thrown by dvc.

In fact we do something like following in a script.py

import dvc.repo

dvc_repo = dvc.repo.Repo("repo_path")

dvc_repo.update("target_dvc_file.dvc"), jobs=10)

dvc_repo.reproduce(all_pipelines=True)

This code is executed in a docker container and I tracked the memory usage with htop. With dvc > 2.5.4 the memory slowly increases and don’t stop. After the memory is exceeded I get /bin/bash: line 157: 290 Killed in our logs. This memory increase do not appear if I use dvc >= 2.5.4. Also this seems not to happen if I skip the .update part and execute it manually before executing python script.py.

The incidence itself occurs while “Computing file/dir hashes (only done once)” in dvc repro execution. It seems that there is some kind of memory leak while computing.

So you are using the dvc.repo API directly?

If you run the commands separately via command line, do you get the OOM error?

Apart from that, could you try to run the following (requires pip install filprofiler):

fil-profile run script.py

It should still generate output even if the oom error occurs.

@daavoo Yes I use dvc.repo API directly in my case. If I execute the dvc update file_1.dvc (dvc=2.9.5) in command line it seems to cause the same RAM-Issues but while updating computation. The latest dvc hash was done with dvc=2.9.5 all others with dvc=2.3.0. Is it possible that this inconsistency cause this Issues?

I executed script.py with fil-profile run and got the outputs. How can I send them to you?

Additional my htop after 538 hashes were done displayed the memory allocated is 62,8GB and
after 872 hashes were done 86,5GB

It is to say, that our hash is very huge cause we did many experiments with it.

received logs.

@Jens can you try to re-run forcing a single thread? setting dvc config core.checksum_jobs 1 and also update(..., jobs=1)?

It is to say, that our hash is very huge cause we did many experiments with it.

So what kind of data is contained in the folder you are tracking?

@daavoo Sorry for the late reply. This didn’t solve the Issue. I will send you the logs.

So what kind of data is contained in the folder you are tracking?

For this special case we track only the samples as raw data and two subsamplings of the raw data in separate folders

Hi @Jens. Sorry for late response. I believe your problem is related with Large memory consumption and slow hashing on `dvc status` after `dvc import` of parent of dvc-tracked folder · Issue #6640 · iterative/dvc · GitHub . Please check status there for updates