Checking different elements of a cell in dataframe separated by comma using Pandas-CodePudding

I have three dataframes [Accepted, Ignored, Input]. We have to check the "Subset" column strings of Input df in the subset columns of accepted and ignored df. If the subset are not present in both of them, we have to keep them in exception. [Strings should match exactly]] For eg.

Accepted df is

Subset1    Subset2  
AC         ASD
ZX         AGH

Ignored df is

Subset1    Subset2
AD         ADS

Input df is

Name    Subset
Apple   AC,AD,AGH
Bat     ZX,ADS,AXC
Cat     HJ,AGH

Output we need:

 Name    Subset        Accepted    Ignored     Exception
 Apple   AC,AD,AGH     AC, AGH     AD
 Bat     ZX,ADS,AXC    ZX          ADS         AXC
 Cat     HJ,AGH        AGH                     HJ

All the dataframes can have multiple rows and columns.

How can I do it using Python.

CodePudding user response：

Use sets for get intersection and difference in splitted values by ,, last join back to strings by join:

Accepted = set(df1.to_numpy().ravel())
Ignored = set(df2.to_numpy().ravel())
Both = set.union(Accepted,Ignored)

df['Accepted'] = df['Subset'].apply(lambda x: ','.join(Accepted & set(x.split(','))))
df['Ignored'] = df['Subset'].apply(lambda x:  ','.join(Ignored & set(x.split(','))))
df['Exception'] = df['Subset'].apply(lambda x:  ','.join(set(x.split(',')) - Both))

print (df)
    Name      Subset Accepted Ignored Exception
0  Apple   AC,AD,AGH   AC,AGH      AD          
1    Bat  ZX,ADS,AXC       ZX     ADS       AXC
2    Cat      HJ,AGH      AGH                HJ

CodePudding user response：

Here's a way to do with vanilla lists.

accepted_list = [item for sublist in accepted_df.values.tolist() for item in sublist]
ignored_list = [item for sublist in ignored_df.values.tolist() for item in sublist]

input_df["Subset"] = input_df["Subset"].apply(lambda x: x.split(","))
input_df["Accepted"] = input_df["Subset"].apply(lambda x: [i for i in x if i in accepted_list])
input_df["Ignored"] = input_df["Subset"].apply(lambda x: [i for i in x if i in ignored_list])
input_df["Exception"] = input_df["Subset"].apply(lambda x: [i for i in x if i not in accepted_list   ignored_list])

Basically checking if members in our list are/not in other list. If you want to keep in a string, you can join the members of the list by "'".join as jezrael suggested.

CodePudding user response：

Assuming accepted, ignored, and inp the input dataframes, you can achieve this by reshaping your dataframe and mapping from a dictionary:

cat = (pd.concat({'Accepted': accepted, 'Ignored': ignored})
       .stack()
       .reset_index()
       .set_index(0)['level_0'].to_dict()
    )
# {'AC': 'Accepted', 'ASD': 'Accepted', 'ZX': 'Accepted',
#  'AGH': 'Accepted', 'AD': 'Ignored', 'ADS': 'Ignored'}

(inp.assign(col=inp['Subset'].str.split(','))
    .explode('col')
    .assign(value=lambda d: d['col'].map(lambda x: cat.get(x, 'Exception')))
    .pivot_table(index=['Name', 'Subset'], columns='value', values='Subset2',
                 aggfunc=','.join, fill_value='')
    .reset_index().rename_axis(None, axis=1)
)

output:

    Name      Subset Accepted Exception Ignored
0  Apple   AC,AD,AGH   AC,AGH                AD
1    Bat  ZX,ADS,AXC       ZX       AXC     ADS
2    Cat      HJ,AGH      AGH        HJ