Home > Net >  Upsert a csv file from a second file using bash
Upsert a csv file from a second file using bash

Time:02-15

I have a main csv file with records (file1). I then have a second "delta" csv file (file2). I would like to update the main file with the records from the delta file using bash. Existing records should get the new value (replace the row) and new records should be appended.

Example file1

unique_id|value  
'1'|'old'  
'2'|'old'  
'3'|'old'

Example file2

unique_id|value  
'1'|'new'  
'4'|'new'

Desired outcome

unique_id|value  
'1'|'new'  
'2'|'old'  
'3'|'old'  
'4'|'new'

CodePudding user response:

I immediately thought of join, but you cannot specify "take this column if there's a match, otherwise use another column, and have either output end up in a single column".

For command-line processing of CSV files, I really like GoCSV. It has its own CSV-aware join command—which is also limited like join (above)—and it has other commands that we can chain together to produce the desired output.

GoCSV uses a streaming/buffered reader/writer for as many subcommands as it can. Every command but join operates in this buffered-in-buffered-out fashion, but join needs to read both sides in total to match. Still, GoCSV is compiled and just really, really fast.

All GoCSV commands read the delimiter to use from the GOCSV_DELIMITER environment variable, so your first order of business is to export that for your pipe delimiter:

export GOCSV_DELIMITER='|'

Joining is easy, just specify the columns from either file to use as the key. I'm also going to rename the columns now so that we're set up for the conditional logic in the next step. If your columns vary from file to file, you'll want to rename each set of columns first, before you join.

I'm telling gocsv join to pick the first columns from both files, -c 1,1 and use an outer join to keep both left and right sides, regardless of match:

gocsv join -c 1,1 -outer file1.csv file2.csv \
| gocsv rename -c 1,2,3,4 -names 'id_left','val_left','id_right','val_right'
| id_left | val_left | id_right | val_right |
|---------|----------|----------|-----------|
| 1       | old      | 1        | new       |
| 2       | old      |          |           |
| 3       | old      |          |           |
|         |          | 4        | new       |

There's no way to change a value in an existing column based on another column's value, but we can add new columns and use a templating language to define the logic we need.

The following syntax creates two new columns, id_final and val_final. For both columns, if there's a value in val_right that value is used, otherwise val_left is used. This, cominbed with the outer-join of "left then right" from before, gives us the effect of the right side updating/overwriting the left side if the IDs matched:

... \
| gocsv add -name 'id_final'  -t '{{ if .id_right  }}{{ .id_right  }}{{ else }}{{ .id_left  }}{{ end }}' \
| gocsv add -name 'val_final' -t '{{ if .val_right }}{{ .val_right }}{{ else }}{{ .val_left }}{{ end }}'
| id_left | val_left | id_right | val_right | id_final | val_final |
|---------|----------|----------|-----------|----------|-----------|
| 1       | old      | 1        | new       | 1        | new       |
| 2       | old      |          |           | 2        | old       |
| 3       | old      |          |           | 3        | old       |
|         |          | 4        | new       | 4        | new       |

Finally, we can select just the "final" fields and rename them back to their original names:

... \
| gocsv select -c 'id_final','val_final' \
| gocsv rename -c 1,2 -names 'unique_id','value'
| unique_id | value |
|-----------|-------|
| 1         | new   |
| 2         | old   |
| 3         | old   |
| 4         | new   |

GoCSV has pre-built binaries for modern platforms.

CodePudding user response:

I use Miller and run

mlr --csv --fs "|" join --ul --ur -j unique_id --lp "l@" --rp "r@"  -f 01.csv \
then put 'if(is_not_null(${r@value})){$value=${r@value}}else{$value=$value}' \
then cut -x -r -f '@' 02.csv

and I have

unique_id|value
'1'|'new'
'4'|'new'
'2'|'old'
'3'|'old'

I run a full join and I use an if condition, to check if I have value on the right. If I have it, I use it.

  • Related