Home > OS >  Pandas find common NA records across multiple large dataframes
Pandas find common NA records across multiple large dataframes

Time:03-21

I have 3 dataframes like as shown below

ID,col1,col2
1,X,35
2,X,37
3,nan,32
4,nan,34
5,X,21
df1 = pd.read_clipboard(sep=',',skipinitialspace=True)

ID,col1,col2
1,nan,305
2,X,307
3,X,302
4,nan,304
5,X,201
df2 = pd.read_clipboard(sep=',',skipinitialspace=True)

ID,col1,col2
1,X,315
2,nan,317
3,X,312
4,nan,314
5,X,21
df3 = pd.read_clipboard(sep=',',skipinitialspace=True)

Now I want to identify the IDs where col1 is NA in all 3 input dataframes.

So, I tried the below

L1=df1[df1['col1'].isna()]['ID'].tolist()
L2=df2[df2['col1'].isna()]['ID'].tolist()
L3=df3[df3['col1'].isna()]['ID'].tolist()
common_ids_all = list(set.intersection(*map(set, [L1,L2,L3])))
final_df = pd.concat([df1,df2,df3],ignore_index=True)
final_df[final_df['ID'].isin(common_ids_all)]

While the above works, is there any efficient and elegant approach do the above?

As you can see that am repeating the same statement thrice (for 3 dataframes)

However, in my real data, I have 12 dataframes where I have to get IDs where col1 is NA in all 12 dataframes.

update - my current read operation looks like below

fnames = ['file1.xlsx','file2.xlsx', 'file3.xlsx']
dfs=[]
NA_list=[]
def preprocessing(fname):
    df= pd.read_excel(fname, sheet_name="Sheet1")
    df.columns = df.iloc[7]
    df = df.iloc[8: , :]
    NA_list.append(df[df['col1'].isna()]['ID'])
    dfs.append(df)
[preprocessing(fname) for fname in fnames]
final_df = pd.concat(dfs, ignore_index=True)
L1 = NA_list[0]
L2 = NA_list[1]
L3 = NA_list[2]
final_list = (list(set.intersection(*map(set, [L1,L2,L3]))))
final_df[final_df['ID'].isin(final_list)]

CodePudding user response:

You can use:

dfs = [df1, df2, df3]
final_df = pd.concat(dfs).query('col1.isna()')
final_df = final_df[final_df.groupby('ID')['ID'].transform('size') == len(dfs)]
print(final_df)

# Output
   ID col1  col2
3   4  NaN    34
3   4  NaN   304
3   4  NaN   314

Full code:

fnames = ['file1.xlsx','file2.xlsx', 'file3.xlsx']

def preprocessing(fname):
    return pd.read_excel(fname, sheet_name='Sheet1', skiprows=6)

dfs = [preprocessing(fname) for fname in fnames]
final_df = pd.concat([df[df['col1'].isna()] for df in dfs])
final_df = final_df[final_df.groupby('ID')['ID'].transform('size') == len(dfs)]

CodePudding user response:

This are times when def function get you sorted. If the dataframe list will continually change I will create a def function. If I got you right the following will do;

def CombinedNaNs(lst):
newdflist =[]
for d in dflist:
    newdflist.append(d[d['col1'].isna()])
    s=pd.concat(newdflist)
    
return s[s.duplicated(subset=['ID'], keep=False)].drop_duplicates()

 dflist=[df1,df2,df3]#List of dfs

CombinedNaNs(dflist)#apply function



    ID col1  col2
3   4  NaN    34
3   4  NaN   304
3   4  NaN   314
  • Related