Creating subsets of df using pandas groupby and getting a value based on a function-CodePudding

I have df similar to below. I need to select rows where df['Year 2'] is equal or closest to df['Year'] in subsets grouped by df['ID'] so in this example rows 1,2 and 5.

df
    Year     ID   A   Year 2      C
0    2020    12   0    2019       0
1    2020    12   0    2020       0 <-
2    2017    10   1    2017       0 <-
3    2017    10   0    2018       0
4    2019    6    0    2017       0
5    2019    6    1    2018       0 <-

I am trying to achieve that with the following piece of code using group by and passing a function to get the proper row with the closest value for both columns.

df1 = df.groupby(['ID']).apply(min(df['Year 2'], key=lambda x:abs(x-df['Year'].min())))

This particular line returns 'int' object is not callable. Any ideas how to fix this line of code or a fresh approach to the problem is appreciated.

TYIA.

CodePudding user response：

You can subtract both columns by Series.sub, convert to absolute and aggregate indices by minimum values by DataFrameGroupBy.idxmin:

idx = df['Year 2'].sub(df['Year']).abs().groupby(df['ID']).idxmin()

If need new column filled by boolean use Index.isin:

df['new'] = df.index.isin(idx)
print (df)
   Year  ID  A  Year 2  C    new
0  2020  12  0    2019  0  False
1  2020  12  0    2020  0   True
2  2017  10  1    2017  0   True
3  2017  10  0    2018  0  False
4  2019   6  0    2017  0  False
5  2019   6  1    2018  0   True

If need filter rows use DataFrame.loc:

df1 = df.loc[idx]
print (df1)
   Year  ID  A  Year 2  C
5  2019   6  1    2018  0
2  2017  10  1    2017  0
1  2020  12  0    2020  0

One row solution:

df1 = df.loc[df['Year 2'].sub(df['Year']).abs().groupby(df['ID']).idxmin()]

CodePudding user response：

You could get the idxmin per group:

idx = (df['Year 2']-df['Year']).abs().groupby(df['ID']).idxmin()

# assignment for test
df.loc[idx, 'D'] = '<-'

for selection only:

df2 = df.loc[idx]

output:

   Year  ID  A  Year 2  C    D
0  2020  12  0    2019  0  NaN
1  2020  12  0    2020  0   <-
2  2017  10  1    2017  0   <-
3  2017  10  0    2018  0  NaN
4  2019   6  0    2017  0  NaN
5  2019   6  1    2018  0   <-

Note that there is a difference between:

df.loc[df.index.isin(idx)]

which gets all the min rows

and:

df.loc[idx]

which gets the first match