My dataframe looks like this:
df = pd.DataFrame({"ID": [1, 2],
"fields": [["eggs", "apple", "toy", "orange", "bear", "red"],
["orange", "bear", "red"]],
"subfields":[["bear", "red"],["bear", "red"]]})
and my expected output is:
ID | fields | overlap between subfields and fields |
---|---|---|
1 | eggs | N |
1 | apple | N |
1 | toy | N |
1 | orange | N |
1 | bear | Y |
1 | red | Y |
2 | orange | N |
2 | bear | Y |
2 | red | Y |
is there anyway that I can do this? thanks a lot!
CodePudding user response:
Here is a very simple solution.
Disclaimer: If you have anything more than ~15 elements in each list, forget about my solution.
new_df = df.explode('fields')
new_df['overlap between subfields and fields'] = new_df.apply(
lambda row: row['fields'] in row['subfields'],
axis=1
).replace({True: 'Y', False: 'N'})
new_df.drop(columns='subfields', inplace=True)
CodePudding user response:
You can explode
and use groupby
:
s = df.set_index('ID')['subfields']
out = (df
.explode('fields')
.assign(**{'overlap': lambda d: np.where(d.groupby('ID', group_keys=False)
.apply(lambda g: g['fields'].isin(s[g.name]))
, 'Y', 'N'
)
})
.drop(columns='subfields') # comment to keep subfields
)
output:
ID fields overlap
0 1 eggs N
0 1 apple N
0 1 toy N
0 1 orange N
0 1 bear Y
0 1 red Y
1 2 orange N
1 2 bear Y
1 2 red Y
alternative:
df2 = pd.DataFrame([[i, set(f).intersection(s)] for i,f,s, in
zip(df['ID'], df['fields'], df['subfields'])],
columns=['ID', 'fields']
).explode('fields')
new_col = 'overlap'
out = (df.drop(columns='subfields')
.explode('fields')
.merge(df2.assign(**{new_col: 'Y'}), how='left')
.fillna({new_col: 'N'})
)
CodePudding user response:
new_df = df.explode('fields')
new_df
use apply for easy and cleaner way
def overlap(fields, subfields):
if fields in subfields:
return 'Y'
else:
return 'N'
new_df['overlap'] = new_df[['fields','subfields']].apply(lambda x: overlap(x.fields,x.subfields),axis=1)
del new_df['subfields']
new_df
CodePudding user response:
Actually you can use isin
that does the checking for each element.
With that its pretty easy to do this:
s = df.apply(lambda x: np.isin(x['fields'], x['subfields']), axis=1).explode()
s = np.where(s, 'Y', 'N')
df = df.explode('fields').assign(overlap=s).drop('subfields',axis=1)
print(df):
ID fields overlap
0 1 eggs N
0 1 apple N
0 1 toy N
0 1 orange N
0 1 bear Y
0 1 red Y
1 2 orange N
1 2 bear Y
1 2 red Y