I have multiple text files with multiple tab delimited columns, from which I would like to extract rows with NaN values in column PTA
and add the filename as additional column to those extracted rows.
So for example:
File1
i B C D E F G H I J PTA K L
0 0.24055 0.31092 0.03447 0.00015 0.93464 0.08232 0.52609 0.00560 0.44018 0.06337 236 770
1 0.43976 0.45359 0.01220 0.93317 0.05711 0.06316 0.49310 0.05882 0.51825 0.18522 433 573
2 0.48067 0.17356 0.96903 0.02968 0.08864 0.05567 0.30423 0.02337 0.01981 0.56240 481 525
3 0.41872 0.18580 0.00191 0.08048 0.90871 0.02035 0.23598 0.01610 0.19815 NaN 422 584
File2
i B C D E F G H I J PTA K L
0 0.1234 0.31092 0.356 0.00015 0.93464 0.08232 0.52609 0.5873 0.0034 0.06337 367 985
1 0.975 0.367 0.01220 0.875 0.05711 0.0365 0.49310 0.05882 0.51825 NaN 635 784
2 0.48067 0.17356 0.96903 0.02968 0.08864 0.05567 0.30423 0.02337 0.01981 0.823 956 213
3 0.41872 0.18580 0.00191 0.08048 0.90871 0.02035 0.23598 0.01610 0.19815 1.30621 678 943
Expected output df:
i B C D E F G H I J PTA K L
3 0.41872 0.18580 0.00191 0.08048 0.90871 0.02035 0.23598 0.01610 0.19815 NaN 422 584 File1
1 0.975 0.367 0.01220 0.875 0.05711 0.0365 0.49310 0.05882 0.51825 NaN 635 784 File2
This I would like to do over multiple files using python. At the moment I have tried this code, but I am not sure how to put it into a proper loop:
# import required module
import os
import pandas as pd
# assign directory
directory = 'files'
for filename in os.listdir(directory):
f = os.path.join(directory, filename)
df=pd.read_csv(f, sep='\t',comment='#')
print(df)
rows=df[df['PTA'].isna()]
print(rows)
At the moment, I am missing the part, where to add those rows into the new data frame.
CodePudding user response:
While iterating your files add the new column with filename to your filtered data, append()
the new dataframe
to a list and pd.concat()
all dataframes from list:
...
na_data = []
for filename in os.listdir(directory):
f = os.path.join(directory, filename)
df=pd.read_csv(f, sep='\t',comment='#')
rows=df[df['PTA'].isna()]
if not rows.empty():
rows['filename'] = filename
na_data.append(rows)
pd.concat(na_data)