only keep the value of a column in first match after merging, for rest it can be 0.0-CodePudding

I have two pandas dataframe, one has columns a, b, c & other has column a, b, d.

df1

id  month   c
1   1   TE
2   1   TE
1   1   NTE
2   1   NTE

df2

id  month   price
1   1   4
2   1   6

I want to merge these dataframes on id & month columns combinedly. so i did following -

df1.merge(df2, how='left', left_on=['id', 'month'], right_on=['id', 'month'])

results of above code is as expected. now, what i want is, after merge, only one pair of id & month should have price, other pair have price as 0

so the result should be like

id  month   c   price
1   1   TE  4
2   1   TE  6
1   1   NTE 0
2   1   NTE 0

Above can be done, while calling a check one each row, which i think is very expensive, n*n complexity.

Any leads with less expense is most welcome.

CodePudding user response：

df.price *= ~df.groupby(["id", "month"]).cumcount().astype(bool)

i use .cumcount() as a "is this first in the group" mask:

In [89]: df
Out[89]:
   id  month    c  price
0   1      1   TE      4
1   2      1   TE      6
2   1      1  NTE      4
3   2      1  NTE      6

In [90]: df.groupby(["id", "month"]).cumcount()
Out[90]:
0    0
1    0
2    1
3    1
dtype: int64

In [91]: ~_.astype(bool)
Out[91]:
0     True
1     True
2    False
3    False
dtype: bool

In [92]: df.price *= _

In [93]: df
Out[93]:
   id  month    c  price
0   1      1   TE      4
1   2      1   TE      6
2   1      1  NTE      0
3   2      1  NTE      0

CodePudding user response：

merged = df1.merge(df2, how='left', on=['id', 'month'])  

def f(x):
    x.iloc[1:] = 0
    return x

merged['price'] = merged.groupby(['id', 'month'])['price'].transform(f)

Note: if they are the same, you don't have to specify left_on and right_on separately.

CodePudding user response：

You can use a cumcount as grouper and fill the rest:

cols = ['id', 'month']
(df1.assign(rank=df1.groupby(cols).cumcount())
    .merge(df2.assign(rank=0), how='left',
           on=cols ['rank'])
    .fillna({'price': 0}, downcast='infer')
    .drop(columns='rank')
)

output:

   id  month    c  price
0   1      1   TE      4
1   2      1   TE      6
2   1      1  NTE      0
3   2      1  NTE      0

different column names

cols_left = ['id', 'month']
cols_right = ['id', 'month']
(df1.assign(rank=df1.groupby(cols_left).cumcount())
    .merge(df2.assign(rank=0), how='left',
           left_on=cols_left ['rank'],
           right_on=cols_right ['rank'])
    .fillna({'price': 0}, downcast='infer')
    .drop(columns='rank')
)