Home > Enterprise >  Get values from specific columns in different csv files, concat. and create column based on filename
Get values from specific columns in different csv files, concat. and create column based on filename

Time:12-17

I have multiple (more than 100) files like this:

filename: 00.csv

residue, vwd, total  
AAA,0.00, -9.45  
BBB, 0.45, -1.45  
CCC, 0.44, -3    
DDD, 0.1, -10 

filename: 01.csv

residue, vwd, total  
AAA, 2, -0.56  
BBB, -4, -9.32  
CCC, 2.54, -10  
DDD, 3, -6.4  

...

I would like to create a matrix in a new csv file where the first column is "residue", and the others are based on filename (without extension). Below filename, it should be values from "total" column. It would be like this:

residue, 00, 01, ...      
AAA, -9.45, -0,56, ...  
BBB, -1.45, -9.32, ...  
CCC, -3, -10,...  
DDD,  -10, -6.4, ... 

. . .

Thanks in advance!

CodePudding user response:

This will work:

files = ['00.csv', '01.csv']

dfs = []
for file in files:
    df = pd.read_csv(file)
    df.columns = df.columns.str.strip()
    df = df[['residue', 'total']].rename({'total': os.path.splitext(file)[0]}, axis=1)
    dfs.append(df)

df = cols[0]
for sub_df in cols[1:]:
    df = df.merge(sub_df, on='residue')

Output:

>>> df
  residue     00     01
0     AAA  -9.45  -0.56
1     BBB  -1.45  -9.32
2     CCC  -3.00 -10.00
3     DDD -10.00  -6.40

CodePudding user response:

Lets try read the files then pivot table

first get all csv files in that dir

import os

#current directory
path = "."
file_names = list(filter(lambda x:x.endswith(".csv"),os.listdir(path)))

print(file_names)
['filename: 00.csv', 'filename: 01.csv']
df = pd.DataFrame()

for file in file_names:
   df = df.append(pd.concat([pd.read_csv(file),pd.DataFrame([file]*4,
   columns= ["filename"])],axis=1))

# to extract digit from filename
df["filename"] =df["filename"].str.extract('(\d )')

print(df)
  residue vwd  total  filename
0 AAA    0.00  -9.45  00
1 BBB    0.45  -1.45  00
2 CCC    0.44  -3.00  00
3 DDD    0.10  -10.00 00
0 AAA    2.00  -0.56  01
1 BBB   -4.00  -9.32  01
2 CCC    2.54  -10.00 01
3 DDD    3.00  -6.40  01

Then pivot_table will do the job

df.pivot_table(values='total', index='residue', columns='filename')

Final output:

filename  00   01
residue       
AAA          -9.45 -0.56
BBB          -1.45 -9.32
CCC          -3.00 -10.00
DDD          -10.00 -6.40
  • Related