Home > Software engineering >  How to merge data (CSV) files from multiple branches (Git and DVC)?
How to merge data (CSV) files from multiple branches (Git and DVC)?

Time:02-19

Background: In my projects I'm using GIT and DVC to keep track of versions:

  • GIT - only for source codes
  • DVC - for dataset, model objects and outputs

I'm testing different approaches in separate branches, i.e:

  • random_forest
  • neural_network_1
  • ...

Typically as an output I'm keeping predictions in csv file with standarised name (i.e.: pred_test.csv). As a consequence in different branches I've different pred_test.csv files. The structure of the file is very simple, it contains two columns:

  • ID
  • Prediction

Question: What is the best way to merge those prediction files into single big file?

I would like to obtain a file with structure:

  • ID
  • Prediction_random_forest
  • Prediction_neural_network_1
  • Prediction_...

My main issue is how to access files with predictions which are in different branches?

CodePudding user response:

I would try to use dvc get in this case:

dvc get -o random_forest_pred.csv --rev random_forest . pred_test.csv

It should bring the pred_test.csv from the random_forest branch.

Mind the . before the pred_test.csv please, it's needed and it means that "use the current repo", since dvc get could also be used on other repos (e.g. GitHub URL)

Then I think you could use some CLI or write a script to join the files:

https://unix.stackexchange.com/questions/293775/merging-contents-of-multiple-csv-files-into-single-csv-file

  • Related