I have three dataframes [Accepted, Ignored, Input]. We have to check the "Subset" column strings of Input df in the subset columns of accepted and ignored df. If the subset are not present in both of them, we have to keep them in exception. [Strings should match exactly]] For eg.
Accepted df is
Subset1 Subset2
AC ASD
ZX AGH
Ignored df is
Subset1 Subset2
AD ADS
Input df is
Name Subset
Apple AC,AD,AGH
Bat ZX,ADS,AXC
Cat HJ,AGH
Output we need:
Name Subset Accepted Ignored Exception
Apple AC,AD,AGH AC, AGH AD
Bat ZX,ADS,AXC ZX ADS AXC
Cat HJ,AGH AGH HJ
All the dataframes can have multiple rows and columns.
How can I do it using Python.
CodePudding user response:
Use sets for get intersection and difference in splitted values by ,
, last join back to strings by join
:
Accepted = set(df1.to_numpy().ravel())
Ignored = set(df2.to_numpy().ravel())
Both = set.union(Accepted,Ignored)
df['Accepted'] = df['Subset'].apply(lambda x: ','.join(Accepted & set(x.split(','))))
df['Ignored'] = df['Subset'].apply(lambda x: ','.join(Ignored & set(x.split(','))))
df['Exception'] = df['Subset'].apply(lambda x: ','.join(set(x.split(',')) - Both))
print (df)
Name Subset Accepted Ignored Exception
0 Apple AC,AD,AGH AC,AGH AD
1 Bat ZX,ADS,AXC ZX ADS AXC
2 Cat HJ,AGH AGH HJ
CodePudding user response:
Here's a way to do with vanilla lists.
accepted_list = [item for sublist in accepted_df.values.tolist() for item in sublist]
ignored_list = [item for sublist in ignored_df.values.tolist() for item in sublist]
input_df["Subset"] = input_df["Subset"].apply(lambda x: x.split(","))
input_df["Accepted"] = input_df["Subset"].apply(lambda x: [i for i in x if i in accepted_list])
input_df["Ignored"] = input_df["Subset"].apply(lambda x: [i for i in x if i in ignored_list])
input_df["Exception"] = input_df["Subset"].apply(lambda x: [i for i in x if i not in accepted_list ignored_list])
Basically checking if members in our list are/not in other list. If you want to keep in a string, you can join the members of the list by "'".join as jezrael suggested.
CodePudding user response:
Assuming accepted
, ignored
, and inp
the input dataframes, you can achieve this by reshaping your dataframe and mapping from a dictionary:
cat = (pd.concat({'Accepted': accepted, 'Ignored': ignored})
.stack()
.reset_index()
.set_index(0)['level_0'].to_dict()
)
# {'AC': 'Accepted', 'ASD': 'Accepted', 'ZX': 'Accepted',
# 'AGH': 'Accepted', 'AD': 'Ignored', 'ADS': 'Ignored'}
(inp.assign(col=inp['Subset'].str.split(','))
.explode('col')
.assign(value=lambda d: d['col'].map(lambda x: cat.get(x, 'Exception')))
.pivot_table(index=['Name', 'Subset'], columns='value', values='Subset2',
aggfunc=','.join, fill_value='')
.reset_index().rename_axis(None, axis=1)
)
output:
Name Subset Accepted Exception Ignored
0 Apple AC,AD,AGH AC,AGH AD
1 Bat ZX,ADS,AXC ZX AXC ADS
2 Cat HJ,AGH AGH HJ