I have a pipeline the output of which is a model tracked with dvc and lets say it is tagged v0.1.0. I would like to use the weights of this model as the initial weights in training a new model. I was wondering what the best way to achieve this would be?
Hi @davesteps!
So in this case what you could do is to leverage dvc import
to achieve your desired results.
- You create model training stage eg.
dvc run -n train_model -d -d train.py data -o model python train.py
- commit and tag the project
- use
dvc import . model --rev {tag} -o initial_weights
-
dvc run -n retrain_model -d data -d initial_weights -d train.py -o retrained_model python train.py
- I assume here that yourtrain.py
code is capable of reconginzing ifinitial_weights
exists and using them if so - also in that case output should be underretrained_model
instead ofmodel
- commit repo state, and tag if necessary
- next repetitions are easier: if you want to reuse
retrained_model
- just rundvc import . retrained_model --rev {new_tag} -o initial_weights
- since you use only one initial_weights you can override them - Repeat commiting, tagging and 6. as long as necessary
Some notes:
The use case that you described could ideally be solved, if we would allow our stages to have model
both as input and output. This case is currently not supported, due to potential problems with reproducibility of the project. I guess that with recent development of our experiments
feature (https://github.com/iterative/dvc/pull/4591) we could reconsider implementing them. I can’t find any open issue for supporting them. @davesteps Could I ask you to create issue on github for circular dependencies support?
Thanks for the reply @Paffciu. Yes I will create feature request for circular dependency support. I think your solution of using dvc import
will work. I simply modified my train
stage to have a dependency on the initial_weights
then a param to optional load them. Just need to remember to run the dvc import
prior to dvc repro
to make sure I am starting with the correct version. It would be really nice to have some way to just specify the version number of weights to load as a param, I guess this might be possible using the python api inside train.py
?
The issue with using import
is that it does not overwrite the existing, so I would need to manually delete every time:
ERROR: failed to import 'model' from '.'. - Paths for outs:
'initial_weights'('initial_weights.dvc')
'initial_weights/model'('initial_weights/model.dvc')
overlap. To avoid unpredictable behaviour, rerun command with non overlapping outs paths.
I have also experimented a solution using the get
instead of import
. The trouble with get
is that it does not look in the local cache for the model and downloads from remote every time. Not sure if there is some reason get
couldn’t look in the cache?
@davesteps
dvc import
(latest version) overwrites existing paths - what is your dvc version
result?
What you seem to encounter is two .dvc
files having overlapping paths - and DVC does not allow that. I am not sure, since I don’t see your repo, but it seems to me like first you did dvc import . initial_weights
and then tried to dvc import . initial_weights/model -o initial_weights/model
- Am I right?
Hi @Paffciu. I have upgraded to version 1.8.2 but still get the issue. I think it might be something to do with the fact that the model
object being imported is actually directory. I have tried importing a dvc tracked yaml file and that works fine, but then on another directory ‘test_data’ and get the same error message.
Edit: in fact i can do dvc import . model/model-weights.h5 --rev v0.1.1 -o initial_weights.h5
so even though the whole dir is tracked i can import the contents separately
DVC version: 1.8.2 (pip)
---------------------------------
Platform: Python 3.6.9 on Linux-5.4.0-48-generic-x86_64-with-Ubuntu-18.04-bionic
Supports: http, https, s3
Cache types: hardlink, symlink
Repo: dvc, git
@davesteps
DVC
is supporting granular imports - it is possible to pick out singular elements of the directory.
So, is your workflow behaving as intended now?
Yes @Paffciu thanks for your help, using dvc import . model/model-weights.h5 --rev v0.1.1 -o initial_weights.h5
works as desired.
I think I have three issues to create on github 1) circular dependencies support 2) The above error when re-importing a directory 3) Request that get
tries the local cache first before getting from remote.
- and 3. - sound like valid issues
- I am not sure if it’s a valid bug - do you remember the steps you took which resulted in that error? I reckon there might have been some error in output generation / specifying import paths. But it’s definitely worth investigating, so if you are willing to, you might create an issue for that as well.