I have a use case where I use pyspark for my model training and the file is in a HDFS.
How would dvc help in managing my ML pipeline.
If this is not support can you please suggest some workaround.
Regards,
Deepesh.
I have a use case where I use pyspark for my model training and the file is in a HDFS.
How would dvc help in managing my ML pipeline.
If this is not support can you please suggest some workaround.
Regards,
Deepesh.
Do you run all of your steps in a cluster? Or only a few first data preprocessing steps on a cluster and then continue on a local machine?
We are building a generic framework that should support
python , R , pyspark and rspark use cases.
A.) For starters if we talk about a pyspark use case we can we can do model training on a cluster and prediction on a local system.
B.) Another use feature transformation… I am adding the data transformation step in the dvc pipeline . Now my Input data is say 2 TB and on HDFS. How would dvc handle this ? Do you see any challenges ? How can I use -d option to refer to a a file on HDFS cluster ?
Hope I am clear with the problem description .
Regards,
Deepesh
Unfortunately, DVC does not support HDFS dependencies yet.
However, you can create two types of workarounds:
dvc remove input.tsv.dvc; cp ~/input.tsv .; dvc add input.tsv
dvc repro
.If you use a pipeline tool like Oozie, Luigi or Airflow I would recommend separating data engineering pipelines and modeling pipelines (DVC). It might look like a complicated solution, however, it is the best way to abstract out your modeling team and activities from data engineering activities. In this case, DVC establishes a well-defined communication protocol between engineers and data scientists (even if this is the same guy ).