Multiple users in one repo

Hi,

We have a scenario that someone might shed some light on.

We are sucessfully implementing Shared Development Server and Data Registries with shared cache and user groups in our deep learning development. It all works perfectly, shared cache, remotes, etc except one particular issue very specific for our use case.

Our current approach for data registries is that for any particular dataset we have a master copy of raw original data in that data registry repo on the server. Any data injections are done only from within that repo. So we always have a copy of raw datasets (not hashed) along with caches and a copy on the remote, which are hashed.

The issue with this scenario is that we need several users be able to add, commit and push from within a particular repo. And even when they are in the same group we have to manually set permissions to 2775 and 0664 for everythiong in .dvc folder (cache is not there as per shared cache scenario) so that dvc works. Git commands work without any permission issues. And from time to time permissions break on the dvc remote too and we have to reset them manually to 2775 and 0444.

Is it something that we might need to setup dvc some other way? Or is it just not supported but can be supported and possibly needs contribution?

Thanks!
Simon

Hi @simon

Just to make sure I understand the issue, you have set

dvc config cache.shared group

as outlined in the docs, correct?

And the .dvc/ permissions issue you have is because all of your users are using git add/commit and dvc push from within a single shared working tree on your server?

Hi @pmrowla

Thanks for prompt reply. Yes, correct, dvc config cache.shared group is set. And we need to work from within a single shared working tree on the server.

After initial setup and first dvc add, git add/commit/push and dvc push by one user, when another user tries to run any dvc command like dvc status, it throws sqlite3.OperationalError: attempt to write a readonly database. As a quick debug we looked for db files in repo which are .dvc/tmp/md5s/cache.db and .dvc/tmp/links/cache.db and they have 0644 permissions by default. Setting everything in .dvc folder to 2775 and 0664 does the trick and other users can run dvc commands from within that repo.

But then occasionally dvc push fails due to folder permissions on the remote when some folders have 2755, when they need 2775 as we understand? At least setting them back to 2775 solves that.

The issue is that regular files created by DVC (such as the state databases that give you the errors) will always have the default permissions for new files set by your OS/shell.

The cache.shared group option forces DVC to explicitly chmod DVC cache directory files to 0664, but it’s not really practical for DVC to do that operation to every possible file that DVC touches (including tmp files). And the normal use case in DVC is for users to have their own working trees (with a single explicitly shared cache directory).

I think what you are really looking for here is setting the appropriate umask in your users’ shell. By default your umask is probably 022 which makes the newly created file permissions 0644. If you run

umask 002

it will make the default permissions for newly created files 0664. Note that this will only be applied for the duration of your current shell session.

You probably aren’t looking to override the default umask system wide on your server, so one possible solution I can think of would be to install DVC in a virtualenv on your server, and add the line for setting your desired umask value (002) in the activate script for that virtualenv.

So when any of your users activate that venv to use DVC, their shell will have the appropriate settings for running DVC in your shared working tree.

You can also accomplish the same thing in any shell script that is sourced before running DVC, for example in a .bashrc (but I wouldn’t recommend using bashrc for this since your users probably only want this specifically for the DVC directory). Depending on your server OS there may be ways to set this type of permission per directory as well via commands like setfacl, but that’s platform specific.

1 Like

Thanks, @pmrowla. This is very helpful. I did some tests with umask 0002 and 0022. Please find below.

uname -a
Linux SERVER 5.4.0-72-generic #80~18.04.1-Ubuntu SMP Mon Apr 12 23:26:25 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

dvc --version
2.1.0

Test umask 0002

umask
0002

mkdir test_umask
cd test_umask
git init
dvc init
ll .dvc
-rw-rw-r-- 1 simon dvcgroup 0 May 17 10:57 config
-rw-rw-r-- 1 simon dvcgroup 26 May 17 10:57 .gitignore
ll .dvc/tmp/md5s/
-rw-r–r-- 1 user dvcgroup 32768 May 17 10:57 cache.db
ll .dvc/tmp/links/
-rw-r–r-- 1 user dvcgroup 32768 May 17 10:57 cache.db

So even with umask 0002, those two db files still have 0644. Other files in .dvc have 0664.

dvc remote add -d myremote ssh://path-to-remote/dvc-test-umask
dvc remote modify myremote ask_password true
cat > test.txt
TEST
dvc add test.txt
dvc push
ll /path-to-remote/dvc-test-umask/
drwxrwsr-x 2 user dvcgroup 4096 May 17 11:03 2d/

Folder permissions in remote are 2775 as expected.

Test umask 0022

rm -rf test_umask
rm -rf /path-to-remote/dvc-test-umask
umask 0022
mkdir test_umask
cd test_umask
git init
dvc init
ll .dvc
-rw-r–r-- 1 user dvcgroup 0 May 17 11:10 config
-rw-r–r-- 1 user dvcgroup 26 May 17 11:10 .gitignore
ll .dvc/tmp/md5s/
-rw-r–r-- 1 user dvcgroup 32768 May 17 11:10 cache.db
ll .dvc/tmp/links/
-rw-r–r-- 1 user dvcgroup 32768 May 17 11:10 cache.db

All files in .dvc have 0644 as expected.

dvc remote add -d myremote ssh://path-to-remote/dvc-test-umask
dvc remote modify myremote ask_password true
cat > test.txt
TEST
dvc add test.txt
dvc push
ll /path-to-remote/dvc-test-umask/
drwxrwsr-x 2 user dvcgroup 4096 May 17 11:15 f1/

Folder permissions in remote are 2775 but expected to be 2755 - not sure if this is expected.

@simon thanks for catching that.

Based on some googling, it looks like this is actually an old sqlite3 bug which has never been resolved. sqlite3 databases are apparently hard-coded to always be created with the umask applied to 0644 rather than 0666 (so setting umask to 0002 will still result in a sqlite3 database file with 0644 permissions).

sqlite3 bug report: [sqlite] permissions for created database files are too restrictive and do not obey umask setting
similar django issue: #19292 (syncdb ignores umask when creating a sqlite database) – Django

The workaround for your case would be to explicitly chmod 0664 .dvc/tmp/**/*.db after calling dvc init (in addition to setting your umask to 0002). This should ensure that the permissions are correct for your entire .dvc directory.

We could potentially consider checking umask and overriding the permissions for those .db files in DVC, but it is not going to be high priority for us given that it’s a native sqlite3 issue that also affects Python in general.

Thanks @pmrowla, much appreciated!