Pandas dataframe create new column indicating overlapping values in other columns-CodePudding

My dataframe looks like this:

df = pd.DataFrame({"ID": [1, 2],
                   "fields": [["eggs", "apple", "toy", "orange", "bear", "red"], 
                              ["orange", "bear", "red"]],
                   "subfields":[["bear", "red"],["bear", "red"]]})

and my expected output is:

ID	fields	overlap between subfields and fields
1	eggs	N
1	apple	N
1	toy	N
1	orange	N
1	bear	Y
1	red	Y
2	orange	N
2	bear	Y
2	red	Y

is there anyway that I can do this? thanks a lot!

CodePudding user response：

Here is a very simple solution.

Disclaimer: If you have anything more than ~15 elements in each list, forget about my solution.

new_df = df.explode('fields')
new_df['overlap between subfields and fields'] = new_df.apply(
    lambda row: row['fields'] in row['subfields'], 
    axis=1
    ).replace({True: 'Y', False: 'N'})

new_df.drop(columns='subfields', inplace=True)

CodePudding user response：

You can explode and use groupby:

s = df.set_index('ID')['subfields']

out = (df
 .explode('fields')
 .assign(**{'overlap': lambda d: np.where(d.groupby('ID', group_keys=False)
                                           .apply(lambda g: g['fields'].isin(s[g.name]))
                                          , 'Y', 'N'
                                         )
           })
 .drop(columns='subfields') # comment to keep subfields
)

output:

   ID  fields overlap
0   1    eggs       N
0   1   apple       N
0   1     toy       N
0   1  orange       N
0   1    bear       Y
0   1     red       Y
1   2  orange       N
1   2    bear       Y
1   2     red       Y

alternative:

df2 = pd.DataFrame([[i, set(f).intersection(s)] for i,f,s, in
                    zip(df['ID'], df['fields'], df['subfields'])],
                    columns=['ID', 'fields']
                   ).explode('fields')

new_col = 'overlap'
out = (df.drop(columns='subfields')
         .explode('fields')
         .merge(df2.assign(**{new_col: 'Y'}), how='left')
         .fillna({new_col: 'N'})
      )

CodePudding user response：

new_df = df.explode('fields')
new_df

use apply for easy and cleaner way

def overlap(fields, subfields):
    if fields in subfields:
        return 'Y'
    else:
        return 'N'


new_df['overlap'] = new_df[['fields','subfields']].apply(lambda x:  overlap(x.fields,x.subfields),axis=1)
del new_df['subfields']
new_df

CodePudding user response：

Actually you can use isin that does the checking for each element.

With that its pretty easy to do this:

s = df.apply(lambda x: np.isin(x['fields'], x['subfields']), axis=1).explode()
s = np.where(s, 'Y', 'N')
df = df.explode('fields').assign(overlap=s).drop('subfields',axis=1)

print(df):

   ID  fields overlap
0   1    eggs       N
0   1   apple       N
0   1     toy       N
0   1  orange       N
0   1    bear       Y
0   1     red       Y
1   2  orange       N
1   2    bear       Y
1   2     red       Y