Home > database >  Get value from grouped data frame maximum in another column
Get value from grouped data frame maximum in another column

Time:12-06

Return and assign the value of column c based on the maximum value of column b in a grouped data frame, grouped on column a.

I'm trying to return and assign (ideally in a different column) the value of column c, determined by the row with the maximum in column b (which are dates). The dataframe is grouped by column a, but I keep getting errors. The data frame is relatively large, with over 16,000 rows.

df = pd.DataFrame({'a': ['a','a','a','b','b','b','c','c','c'],
                   'b': ['2008-11-01', '2022-07-01', '2017-02-01', '2017-02-01', '2018-02-01', '2008-11-01', '2014-11-01', '2008-11-01', '2022-07-01'],
                   'c': [numpy.NaN,6,8,2,1,numpy.NaN,6,numpy.NaN,7]})
df['b'] =  pd.to_datetime(df['b'])
df

    a   b   c
0   a   2008-11-01  NaN
1   a   2022-07-01  6.0
2   a   2017-02-01  8.0
3   b   2017-02-01  2.0
4   b   2018-02-01  1.0
5   b   2008-11-01  NaN
6   c   2014-11-01  6.0
7   c   2008-11-01  NaN
8   c   2022-07-01  7.0

I want the following result:

    a   b   c   d
0   a   2008-11-01  NaN 8.0
1   a   2022-07-01  6.0 8.0
2   a   2017-02-01  8.0 8.0
3   b   2017-02-01  2.0 1.0
4   b   2018-02-01  1.0 1.0
5   b   2008-11-01  NaN 1.0
6   c   2014-11-01  6.0 7.0
7   c   2008-11-01  NaN 7.0
8   c   2022-07-01  7.0 7.0

I've tried the following, but grouped series ('SeriesGroupBy') cannot be cast to another dtype:

df['d'] = df.groupby('a').b.astype(np.int64).max()[df.c].reset_index()

I've set column b to int before running the above code, but get an error saying none of the values in c are in the index.

None of [Index([....values in column c...\n dtype = 'object', name = 'a', length = 16280)] are in the [index]

I've tried expanding out the command but errors (which I just realized relate to no alternative values being given in the pd.where function - but I don't know how to give the correct value):

df= df.groupby('a')

df['d'] = df.['c'].where(grouped['b'] == grouped['b'].max())

I've also tried using solutions provided here, here, and here.

Any help or direction would be appreciated.

CodePudding user response:

I'm assuming the first three values should be 6.0 (because maximum date for group a is 2022-07-01 with value 6):

df["d"] = df["a"].map(
    df.groupby("a").apply(lambda x: x.loc[x["b"].idxmax(), "c"])
)
print(df)

Prints:

   a          b    c    d
0  a 2008-11-01  NaN  6.0
1  a 2022-07-01  6.0  6.0
2  a 2017-02-01  8.0  6.0
3  b 2017-02-01  2.0  1.0
4  b 2018-02-01  1.0  1.0
5  b 2008-11-01  NaN  1.0
6  c 2014-11-01  6.0  7.0
7  c 2008-11-01  NaN  7.0
8  c 2022-07-01  7.0  7.0
  • Related