I need to create a dataframe one 100 csv file. My issues is that I have more than 100 CSVs with more than 55000 rows in each (as primary keys).
Now the difference between the csv files is that is all columns (maybe around 1200 columns) were broken into separate files. In other words, I need to do a FULL JOIN these csv files based on the PRIMARY row.
I got to the below code so far but need help to compelete it.
import os
import glob
import pandas as pd
os.chdir(r'C:\Users\user1\GCP_All')
extension = 'csv'
all_filenames = [i for i in glob.glob('*.{}'.format(extension))]
#combine all files in the list
combined_csv = pd.concat([pd.read_csv(f) for f in all_filenames ])
#export to csv
combined_csv.to_csv(r'C:\Users\user1\GCP_All\combined.csv', index=False, encoding='utf-8-sig')
similar example is this:
CSV_A:
PK TOP_SALES TOP_PURCH
111111 A B
222222 C D
333333 E F
CSV_B:
PK AVG_SALE AVG_PURCH
111111 G H
222222 I J
333333 K L
CSV_C:
PK MIN_SALE MIN_PURCH
111111 M N
222222 O P
333333 Q R
desired output:
PK TOP_SALES TOP_PURCH AVG_SALE AVG_PURCH MIN_SALE MIN_PURCH
111111 A B G H M N
222222 O P I J O P
333333 Q R K L Q R
Thanks in advance! :)
CodePudding user response:
Convert PK
column to index and then add axis=1
to concat
:
combined_csv = pd.concat([pd.read_csv(f, index_col=0) for f in all_filenames ], axis=1)
print (combined_csv)
TOP_SALES TOP_PURCH AVG_SALE AVG_PURCH MIN_SALE MIN_PURCH
PK
111111 A B G H M N
222222 C D I J O P
333333 E F K L Q R