Trace back files in non-S3 bucket in structured way

ksharma · July 23, 2024, 4:19pm

Hi,
I am currently exploring DVC to implement in my organization’s project and in way so that we could productionize it in future.

This project has multiple folders and each with set of input files. I have tried DVC remote to upload it to bucket and keep versions by adding folder to dvc. But I have few problem here:

I have data folder in my repo, so is there a way I can take data from SharePoint and store directly in bucket without adding those file to current repo and track at the same time ? - will this be the way to go for this question external-data
currently when I push the files to bucket, it has structure created on its own which is files/md5/e4/12344 . I need to take files from this bucket and use it in my pipeline. But how do I distinguish which md5 hash name created file to pass as it does not have reference to actual input name - for example file A required in code1 and file B required in code 2. but Md5 hash name will be difficult also versions are created so how do we know which hash file to pass.
Also if you say we can refer lock files and have stages in yaml which can have dependency then I still have to manually refer there or write in stages in yaml. DVC is not doing that automatically.
Also, in future when I have to refer to the old version of file from bucket how do i know since no time stamp or anything and hash name makes it difficult to check which file what was first latest, second latest version. I know there is DVC checkout but for that as well i have to keep passing name of file and then do dvc checkout.

Before posting here I have been through documents and I couldn’t find the use case implementation such as mine completely.

If you can help with this, then I can have my POC completed and take the decision of implementing this or any other alternative to look for.

Topic		Replies	Views
Get files directly from remote Questions	1	73	January 6, 2025
DVC Heartbeat - Discord gems Announcements	3	4165	June 27, 2019
Tracking files stored in S3 without adding it into local storage Questions	4	1109	July 5, 2023
Using DuckDB to query DVC-versioned files directly in an object storage remote? Questions	1	43	March 14, 2025
Is it Possible to train data in s3 bucket without downloading to local machine with DVC? Questions	11	789	March 22, 2023

Trace back files in non-S3 bucket in structured way

Related topics