Looping over text files and get rows with NaN values in column in python-CodePudding

I have multiple text files with multiple tab delimited columns, from which I would like to extract rows with NaN values in column PTA and add the filename as additional column to those extracted rows.

So for example:

File1

             i       B       C  D  E  F  G       H       I       J      PTA  K  L
0  0.24055  0.31092   0.03447   0.00015   0.93464   0.08232  0.52609  0.00560  0.44018  0.06337   236   770
1  0.43976  0.45359   0.01220   0.93317   0.05711   0.06316  0.49310  0.05882  0.51825  0.18522   433   573
2  0.48067  0.17356   0.96903   0.02968   0.08864   0.05567  0.30423  0.02337  0.01981  0.56240   481   525
3  0.41872  0.18580   0.00191   0.08048   0.90871   0.02035  0.23598  0.01610  0.19815  NaN   422   584

File2

             i       B       C  D  E  F  G       H       I       J      PTA  K  L
0  0.1234  0.31092   0.356   0.00015   0.93464   0.08232  0.52609  0.5873  0.0034  0.06337   367   985
1  0.975  0.367   0.01220   0.875   0.05711   0.0365  0.49310  0.05882  0.51825  NaN   635   784
2  0.48067  0.17356   0.96903   0.02968   0.08864   0.05567  0.30423  0.02337  0.01981  0.823   956   213
3  0.41872  0.18580   0.00191   0.08048   0.90871   0.02035  0.23598  0.01610  0.19815  1.30621   678   943

Expected output df:

      i       B       C  D  E  F  G       H       I       J      PTA  K  L
3  0.41872  0.18580   0.00191   0.08048   0.90871   0.02035  0.23598  0.01610  0.19815  NaN   422   584   File1
1  0.975  0.367   0.01220   0.875   0.05711   0.0365  0.49310  0.05882  0.51825  NaN   635   784   File2

This I would like to do over multiple files using python. At the moment I have tried this code, but I am not sure how to put it into a proper loop:

# import required module
import os
import pandas as pd
# assign directory
directory = 'files'
 
for filename in os.listdir(directory):
    f = os.path.join(directory, filename)
    df=pd.read_csv(f, sep='\t',comment='#')
    print(df)
    rows=df[df['PTA'].isna()]
    print(rows)

At the moment, I am missing the part, where to add those rows into the new data frame.

CodePudding user response：

While iterating your files add the new column with filename to your filtered data, append() the new dataframe to a list and pd.concat() all dataframes from list:

...
na_data = []

for filename in os.listdir(directory):
    f = os.path.join(directory, filename)
    df=pd.read_csv(f, sep='\t',comment='#')
    
    rows=df[df['PTA'].isna()]
    if not rows.empty():
        rows['filename'] = filename
        na_data.append(rows)

pd.concat(na_data)