I have 3 dataframes like as shown below
ID,col1,col2
1,X,35
2,X,37
3,nan,32
4,nan,34
5,X,21
df1 = pd.read_clipboard(sep=',',skipinitialspace=True)
ID,col1,col2
1,nan,305
2,X,307
3,X,302
4,nan,304
5,X,201
df2 = pd.read_clipboard(sep=',',skipinitialspace=True)
ID,col1,col2
1,X,315
2,nan,317
3,X,312
4,nan,314
5,X,21
df3 = pd.read_clipboard(sep=',',skipinitialspace=True)
Now I want to identify the IDs
where col1
is NA
in all 3 input dataframes.
So, I tried the below
L1=df1[df1['col1'].isna()]['ID'].tolist()
L2=df2[df2['col1'].isna()]['ID'].tolist()
L3=df3[df3['col1'].isna()]['ID'].tolist()
common_ids_all = list(set.intersection(*map(set, [L1,L2,L3])))
final_df = pd.concat([df1,df2,df3],ignore_index=True)
final_df[final_df['ID'].isin(common_ids_all)]
While the above works, is there any efficient and elegant approach do the above?
As you can see that am repeating the same statement thrice (for 3 dataframes)
However, in my real data, I have 12 dataframes where I have to get IDs where col1
is NA in all 12 dataframes.
update - my current read operation looks like below
fnames = ['file1.xlsx','file2.xlsx', 'file3.xlsx']
dfs=[]
NA_list=[]
def preprocessing(fname):
df= pd.read_excel(fname, sheet_name="Sheet1")
df.columns = df.iloc[7]
df = df.iloc[8: , :]
NA_list.append(df[df['col1'].isna()]['ID'])
dfs.append(df)
[preprocessing(fname) for fname in fnames]
final_df = pd.concat(dfs, ignore_index=True)
L1 = NA_list[0]
L2 = NA_list[1]
L3 = NA_list[2]
final_list = (list(set.intersection(*map(set, [L1,L2,L3]))))
final_df[final_df['ID'].isin(final_list)]
CodePudding user response:
You can use:
dfs = [df1, df2, df3]
final_df = pd.concat(dfs).query('col1.isna()')
final_df = final_df[final_df.groupby('ID')['ID'].transform('size') == len(dfs)]
print(final_df)
# Output
ID col1 col2
3 4 NaN 34
3 4 NaN 304
3 4 NaN 314
Full code:
fnames = ['file1.xlsx','file2.xlsx', 'file3.xlsx']
def preprocessing(fname):
return pd.read_excel(fname, sheet_name='Sheet1', skiprows=6)
dfs = [preprocessing(fname) for fname in fnames]
final_df = pd.concat([df[df['col1'].isna()] for df in dfs])
final_df = final_df[final_df.groupby('ID')['ID'].transform('size') == len(dfs)]
CodePudding user response:
This are times when def function get you sorted. If the dataframe list will continually change I will create a def function. If I got you right the following will do;
def CombinedNaNs(lst):
newdflist =[]
for d in dflist:
newdflist.append(d[d['col1'].isna()])
s=pd.concat(newdflist)
return s[s.duplicated(subset=['ID'], keep=False)].drop_duplicates()
dflist=[df1,df2,df3]#List of dfs
CombinedNaNs(dflist)#apply function
ID col1 col2
3 4 NaN 34
3 4 NaN 304
3 4 NaN 314