Support for HDFS

deepesh · March 29, 2018, 4:57am

I have a use case where I use pyspark for my model training and the file is in a HDFS.
How would dvc help in managing my ML pipeline.

If this is not support can you please suggest some workaround.

Regards,
Deepesh.

dmitry · March 30, 2018, 6:20am

Do you run all of your steps in a cluster? Or only a few first data preprocessing steps on a cluster and then continue on a local machine?

deepesh · April 6, 2018, 11:25am

We are building a generic framework that should support
python , R , pyspark and rspark use cases.

A.) For starters if we talk about a pyspark use case we can we can do model training on a cluster and prediction on a local system.

B.) Another use feature transformation… I am adding the data transformation step in the dvc pipeline . Now my Input data is say 2 TB and on HDFS. How would dvc handle this ? Do you see any challenges ? How can I use -d option to refer to a a file on HDFS cluster ?

Hope I am clear with the problem description .

Regards,
Deepesh

dmitry · April 6, 2018, 6:26pm

Unfortunately, DVC does not support HDFS dependencies yet.

However, you can create two types of workarounds:

Implement a step which generated a new file if there are any changes in HDFS.
Separate the cluster part and the local part:

trigger the cluster part manually
update files in DVC dvc remove input.tsv.dvc; cp ~/input.tsv .; dvc add input.tsv
reproduce the local part dvc repro.

If you use a pipeline tool like Oozie, Luigi or Airflow I would recommend separating data engineering pipelines and modeling pipelines (DVC). It might look like a complicated solution, however, it is the best way to abstract out your modeling team and activities from data engineering activities. In this case, DVC establishes a well-defined communication protocol between engineers and data scientists (even if this is the same guy ).

Topic		Replies	Views
DVC Heartbeat - Discord gems Announcements	3	4167	June 27, 2019
DVC compared with GitLFS for storage and versioning only Questions	12	6990	October 13, 2020
Batch pipeline support and multi I/O Questions	2	1067	June 28, 2018
Workflow on slurm-like clusters Questions	4	2256	September 22, 2020
Need to build non-ML data pipeline, is DVC good fit? Questions	7	1192	August 24, 2021

Support for HDFS

Related topics