Nice. Would be good to mention that the data file itself will not be checked in to git, but rather by dvc, so it is present in .gitignore.
Hi, I’m trying to import this with pandas but I keep getting input/output error. How are you importing this dataset?
Hi @mgntprg, have you imported the dataset to your local machine using
dvc get? As in:
$ dvc get https://github.com/iterative/aita_dataset aita_clean.csv
I can’t tell if you’ve done this step yet. If you didn’t, then the file won’t be in your local workspace and pandas won’t be able to import it. After doing
dvc get, you should be able to open Python and run
df = pd.read_csv("aita_clean.csv")
Hi, I finally fixed it. I had the file but it was my pd.read_csv parameters that I needed to change to make it work
df = pd.read_csv('gdrive/My Drive//aita_clean.csv', error_bad_lines=False, nrows=10, encoding = "ISO-8859-1", engine='python')
I’ll try some analysis on the dataset and get back to you with my results
Cool, please be in touch with any results. I always like hearing about them
Also, you might be interested in the DVC python API- you can do
dvc.api.open to load the file from DVC storage directly to your Python environment https://dvc.org/doc/api-reference/open#:~:text=Description,by%20DVC%20or%20by%20Git.