How to add columns in dataframe under my specific condition?-CodePudding

I have a dataframe df1 with columns id1 and id2:

id1     id2
a1      b2
a6      b2
a3      c7

and another dataframe df2:

id      f1      f2    f3     f4    f5
a1      0       1     0      0     1
a2      0       1     1      0     1
b2      1       1     0      0     1
a6      1       1     0      0     1
a3      1       0     0      0     0
c7      1       0     0      0     1

i want to add to df1 column names of df2 for those id's which are in id1 and id2 and have same value for the pair. So desired result is:

id1     id2    feature
a1      b2      f2,f5 
a6      b2      f1,f2,f5
a3      c7      f1

CodePudding user response：

You can add a column to df2 which represents the feature it contains.

df2['f'] = df2.apply(lambda row: [c for c, v in zip(df2.columns[1:], row[1:]) if v], axis=1)
df2 = df2.set_index('id')

print(df2)

    f1  f2  f3  f4  f5             f
id
a1   0   1   0   0   1      [f2, f5]
a2   0   1   1   0   1  [f2, f3, f5]
b2   1   1   0   0   1  [f1, f2, f5]
a6   1   1   0   0   1  [f1, f2, f5]
a3   1   0   0   0   0          [f1]
c7   1   0   0   0   1      [f1, f5]

Then use an apply on df1 to get the common part features:

df1['feature'] = df1.apply(lambda row: ','.join(set(df2.loc[row['id1'], 'f']) & set(df2.loc[row['id2'], 'f'])), axis=1)

  id1 id2   feature
0  a1  b2     f5,f2
1  a6  b2  f1,f5,f2
2  a3  c7        f1

CodePudding user response：

You could stack df2 and then perform a double merge on df1 to get the features for the relevant id combinations:

# get a stacked version of df2:
df2_stacked = (df2.set_index("id")
                  .replace({0: np.nan})
                  .rename_axis(columns="feature")
                  .stack()
                  .rename("val")
                  .reset_index())

# produces all relevant combinations of id and feature:
    id feature  val
0   a1      f2  1.0
1   a1      f5  1.0
2   a2      f2  1.0
3   a2      f3  1.0
...

# Then merge this table with df1 twice, first on id1, 
# then on (id2, feature), thereby giving a stacked table 
# with all relevant combinations of id1 and id2 with their 
# corresponding equal feature. Then groupby and unique:

pd.merge(df2_stacked,
         pd.merge(df1, df2_stacked, left_on="id1", right_on="id"),
         left_on=["id", "feature"],
         right_on=["id2", "feature"]
        ).groupby(["id1", "id2"]).feature.unique().reset_index()

# out
  id1 id2       feature
0  a1  b2      [f2, f5]
1  a3  c7          [f1]
2  a6  b2  [f1, f2, f5]

CodePudding user response：

for the provided data example this code works as well:

df2 = df2.replace({0:np.nan})

f = lambda y: set(df2.loc[df2.id==y,'f1':'f5'].dropna(1).columns)
df1['feature'] = df1.apply(lambda x: f(x.id1)&f(x.id2),1)

>>> df1
'''
  id1 id2       feature
0  a1  b2      {f2, f5}
1  a6  b2  {f1, f2, f5}
2  a3  c7          {f1}