I have multiple large CSV files. These CSV files have few column differences. For feeding them to AWS QuickSight for data visualization, I want to unify the structure of these CSV files. I think for doing this I have two ways:
- Add the missing columns to each CSV file so all of them look the same
- Combine all the CSV files into one large file
What is the best tool for doing this?
Is there any tool that can show structural difference of two CSV file? If I find out which columns are missing I can also add them manually.
With pandas I can combine the CSV files, but in the way I know, I should name all the columns (code below) and this is not useful.
import pandas as pd
df1 = pd.DataFrame({'column1': [1,2],
'column2': [3,4],
})
df2 = pd.DataFrame({'column1': [5,6],
'column3': [7,8],
})
pd.concat([df1,df2],ignore_index=True)
Result:
column1 column2 column3
0 1 3.0 NaN
1 2 4.0 NaN
2 5 NaN 7.0
3 6 NaN 8.0
CodePudding user response:
I cannot tell you what "the best" tool is; that's subjective with many dependencies.
I can tell you that miller should probably be on your short list for tools to consider for working with CSV data. Also see the miller
GitHub site. One last thing: the author is super-helpful.
I have it on good authority that the following will do the job:
mlr --csv reshape -r "^A" -o item,value then reshape -s item,value \ then unsparsify --fill-with "" *.csv > result.csv