I have a dataframe df1 with columns id1 and id2:
id1 id2
a1 b2
a6 b2
a3 c7
and another dataframe df2:
id f1 f2 f3 f4 f5
a1 0 1 0 0 1
a2 0 1 1 0 1
b2 1 1 0 0 1
a6 1 1 0 0 1
a3 1 0 0 0 0
c7 1 0 0 0 1
i want to add to df1 column names of df2 for those id's which are in id1 and id2 and have same value for the pair. So desired result is:
id1 id2 feature
a1 b2 f2,f5
a6 b2 f1,f2,f5
a3 c7 f1
CodePudding user response:
You can add a column to df2
which represents the feature it contains.
df2['f'] = df2.apply(lambda row: [c for c, v in zip(df2.columns[1:], row[1:]) if v], axis=1)
df2 = df2.set_index('id')
print(df2)
f1 f2 f3 f4 f5 f
id
a1 0 1 0 0 1 [f2, f5]
a2 0 1 1 0 1 [f2, f3, f5]
b2 1 1 0 0 1 [f1, f2, f5]
a6 1 1 0 0 1 [f1, f2, f5]
a3 1 0 0 0 0 [f1]
c7 1 0 0 0 1 [f1, f5]
Then use an apply
on df1
to get the common part features:
df1['feature'] = df1.apply(lambda row: ','.join(set(df2.loc[row['id1'], 'f']) & set(df2.loc[row['id2'], 'f'])), axis=1)
id1 id2 feature
0 a1 b2 f5,f2
1 a6 b2 f1,f5,f2
2 a3 c7 f1
CodePudding user response:
You could stack df2 and then perform a double merge on df1 to get the features for the relevant id combinations:
# get a stacked version of df2:
df2_stacked = (df2.set_index("id")
.replace({0: np.nan})
.rename_axis(columns="feature")
.stack()
.rename("val")
.reset_index())
# produces all relevant combinations of id and feature:
id feature val
0 a1 f2 1.0
1 a1 f5 1.0
2 a2 f2 1.0
3 a2 f3 1.0
...
# Then merge this table with df1 twice, first on id1,
# then on (id2, feature), thereby giving a stacked table
# with all relevant combinations of id1 and id2 with their
# corresponding equal feature. Then groupby and unique:
pd.merge(df2_stacked,
pd.merge(df1, df2_stacked, left_on="id1", right_on="id"),
left_on=["id", "feature"],
right_on=["id2", "feature"]
).groupby(["id1", "id2"]).feature.unique().reset_index()
# out
id1 id2 feature
0 a1 b2 [f2, f5]
1 a3 c7 [f1]
2 a6 b2 [f1, f2, f5]
CodePudding user response:
for the provided data example this code works as well:
df2 = df2.replace({0:np.nan})
f = lambda y: set(df2.loc[df2.id==y,'f1':'f5'].dropna(1).columns)
df1['feature'] = df1.apply(lambda x: f(x.id1)&f(x.id2),1)
>>> df1
'''
id1 id2 feature
0 a1 b2 {f2, f5}
1 a6 b2 {f1, f2, f5}
2 a3 c7 {f1}