Support for HDFS

I have a use case where I use pyspark for my model training and the file is in a HDFS.
How would dvc help in managing my ML pipeline.

If this is not support can you please suggest some workaround.


Do you run all of your steps in a cluster? Or only a few first data preprocessing steps on a cluster and then continue on a local machine?

We are building a generic framework that should support
python , R , pyspark and rspark use cases.

A.) For starters if we talk about a pyspark use case we can we can do model training on a cluster and prediction on a local system.

B.) Another use feature transformation… I am adding the data transformation step in the dvc pipeline . Now my Input data is say 2 TB and on HDFS. How would dvc handle this ? Do you see any challenges ? How can I use -d option to refer to a a file on HDFS cluster ?

Hope I am clear with the problem description .


Unfortunately, DVC does not support HDFS dependencies yet.

However, you can create two types of workarounds:

  1. Implement a step which generated a new file if there are any changes in HDFS.
  2. Separate the cluster part and the local part:
  • trigger the cluster part manually
  • update files in DVC dvc remove input.tsv.dvc; cp ~/input.tsv .; dvc add input.tsv
  • reproduce the local part dvc repro.

If you use a pipeline tool like Oozie, Luigi or Airflow I would recommend separating data engineering pipelines and modeling pipelines (DVC). It might look like a complicated solution, however, it is the best way to abstract out your modeling team and activities from data engineering activities. In this case, DVC establishes a well-defined communication protocol between engineers and data scientists (even if this is the same guy :slight_smile: ).