Get value from grouped data frame maximum in another column-CodePudding

Return and assign the value of column c based on the maximum value of column b in a grouped data frame, grouped on column a.

I'm trying to return and assign (ideally in a different column) the value of column c, determined by the row with the maximum in column b (which are dates). The dataframe is grouped by column a, but I keep getting errors. The data frame is relatively large, with over 16,000 rows.

df = pd.DataFrame({'a': ['a','a','a','b','b','b','c','c','c'],
                   'b': ['2008-11-01', '2022-07-01', '2017-02-01', '2017-02-01', '2018-02-01', '2008-11-01', '2014-11-01', '2008-11-01', '2022-07-01'],
                   'c': [numpy.NaN,6,8,2,1,numpy.NaN,6,numpy.NaN,7]})
df['b'] =  pd.to_datetime(df['b'])
df

    a   b   c
0   a   2008-11-01  NaN
1   a   2022-07-01  6.0
2   a   2017-02-01  8.0
3   b   2017-02-01  2.0
4   b   2018-02-01  1.0
5   b   2008-11-01  NaN
6   c   2014-11-01  6.0
7   c   2008-11-01  NaN
8   c   2022-07-01  7.0

I want the following result:

    a   b   c   d
0   a   2008-11-01  NaN 8.0
1   a   2022-07-01  6.0 8.0
2   a   2017-02-01  8.0 8.0
3   b   2017-02-01  2.0 1.0
4   b   2018-02-01  1.0 1.0
5   b   2008-11-01  NaN 1.0
6   c   2014-11-01  6.0 7.0
7   c   2008-11-01  NaN 7.0
8   c   2022-07-01  7.0 7.0

I've tried the following, but grouped series ('SeriesGroupBy') cannot be cast to another dtype:

df['d'] = df.groupby('a').b.astype(np.int64).max()[df.c].reset_index()

I've set column b to int before running the above code, but get an error saying none of the values in c are in the index.

None of [Index([....values in column c...\n dtype = 'object', name = 'a', length = 16280)] are in the [index]

I've tried expanding out the command but errors (which I just realized relate to no alternative values being given in the pd.where function - but I don't know how to give the correct value):

df= df.groupby('a')

df['d'] = df.['c'].where(grouped['b'] == grouped['b'].max())

I've also tried using solutions provided here, here, and here.

Any help or direction would be appreciated.

CodePudding user response：

I'm assuming the first three values should be 6.0 (because maximum date for group a is 2022-07-01 with value 6):

df["d"] = df["a"].map(
    df.groupby("a").apply(lambda x: x.loc[x["b"].idxmax(), "c"])
)
print(df)

Prints:

   a          b    c    d
0  a 2008-11-01  NaN  6.0
1  a 2022-07-01  6.0  6.0
2  a 2017-02-01  8.0  6.0
3  b 2017-02-01  2.0  1.0
4  b 2018-02-01  1.0  1.0
5  b 2008-11-01  NaN  1.0
6  c 2014-11-01  6.0  7.0
7  c 2008-11-01  NaN  7.0
8  c 2022-07-01  7.0  7.0