Pandas groupby calculation using values from different rows based on other column-CodePudding

I have the following dataframe, observations are grouped in pairs. NaN here represents different products traded in pair wrt A. I want to groupby transaction and compute A/NaN so that the value for all NaNs can be expressed in unit A.

transaction name value ...many other columns
1           A    3
1           NaN  5 
2           NaN  7
2           A    6
3           A    4
3           NaN  3
4           A    10
4           NaN  9
5           C    8
5           A    6
..

Thus the desired df would be

transaction name value new_column  ...many other columns
1           A    3       NaN
1           NaN  6       0.5
2           NaN  7       0.8571
2           A    6       NaN
3           A    4       1.333
3           NaN  3       NaN
4           A    10      1.111
4           NaN  9       NaN
5           C    8       0.75
5           A    6       NaN
...

CodePudding user response：

First filter rows with A and convert transaction to index for possible divide rows with missing value by mapped transaction by Series.map:

m = df['name'].ne('A')

s = df[~m].set_index('transaction')['value']
df.loc[m, 'new_column'] = df.loc[m, 'transaction'].map(s) / df.loc[m, 'value']
print (df)
   transaction name  value  new_column
0            1    A      3         NaN
1            1  NaN      5    0.600000
2            2  NaN      7    0.857143
3            2    A      6         NaN
4            3    A      4         NaN
5            3  NaN      3    1.333333
6            4    A     10         NaN
7            4  NaN      9    1.111111
8            5  NaN      8    0.750000
9            5    A      6         NaN

EDIT: There is multiple A values per groups, not only one, possible solution is removed duplicates:

print (df)
    transaction name  value
0             1    A      3
1             1    A      4
2             1  NaN      5
3             2  NaN      7
4             2    A      6
5             3    A      4
6             3  NaN      3
7             4    A     10
8             4  NaN      9
9             5    C      8
10            5    A      6

# s = df[~m].set_index('transaction')['value']
# df.loc[m, 'new_column'] = df.loc[m, 'transaction'].map(s) / df.loc[m, 'value']
# print (df)
#InvalidIndexError: Reindexing only valid with uniquely valued Index objects


m = df['name'].ne('A')

print (df[~m].drop_duplicates(['transaction','name']))
    transaction name  value
0             1    A      3
4             2    A      6
5             3    A      4
7             4    A     10
10            5    A      6

s = df[~m].drop_duplicates(['transaction','name']).set_index('transaction')['value']
df.loc[m, 'new_column'] = df.loc[m, 'transaction'].map(s) / df.loc[m, 'value']
print (df)
    transaction name  value  new_column
0             1    A      3         NaN <- 2 times a per 1 group
1             1    A      4         NaN <- 2 times a per 1 group
2             1  NaN      5    0.600000
3             2  NaN      7    0.857143
4             2    A      6         NaN
5             3    A      4         NaN
6             3  NaN      3    1.333333
7             4    A     10         NaN
8             4  NaN      9    1.111111
9             5    C      8    0.750000
10            5    A      6         NaN

CodePudding user response：

Assuming there are only two values per transaction, you can use agg and divide the first and last element by each other:

df.loc[df['name'].isna(), 'new_column'] = df.sort_values(by='name').\
                   groupby('transaction')['value'].\
                   agg(f='first', l='last').agg(lambda x: x['f'] / x['l'], axis=1)