Home > Blockchain >  How do I select the first item in a column after grouping for another column in pandas?
How do I select the first item in a column after grouping for another column in pandas?

Time:10-24

I have the following data frame:

df = {
    'name': ['A', 'A', 'B', 'B', 'B', 'C', 'C'],
    'name_ID' : [1, 1, 2, 2, 2, 3, 3],
    'score' : [400, 500, 3000, 1000, 4000, 600, 750],
    'score_number' : [1, 2, 1, 2, 3, 1, 2]
    }
df = pd.DataFrame(df)

Note that the df is grouped by name / name_ID. names can have n scores, e.g. A has 2 scores, whereas B has 3 scores. I want an additional column, that indicates the first score per name / name_ID. The reference_score for the first scores of a name should be NaN. Like this:

enter image description here

I have tried: df_v2['first_fund'] =df_v2['fund_size'].groupby(df_v2['firm_ID']).first(), also with .nth but it didn't work.

Thanks in advance.

CodePudding user response:

Let's use groupby.transform to get first row value then mask the first row as NaN with condition ~df.duplicated('name', keep='first').

# sort the dataframe first if score number is not ascending
# df = df.sort_values(['name_ID', 'score_number'])
df['reference_score'] = (df.groupby('name')['score']
                         .transform('first')
                         .mask(~df.duplicated('name', keep='first')))
print(df)

  name  name_ID  score  score_number  reference_score
0    A        1    400             1              NaN
1    A        1    500             2            400.0
2    B        2   3000             1              NaN
3    B        2   1000             2           3000.0
4    B        2   4000             3           3000.0
5    C        3    600             1              NaN
6    C        3    750             2            600.0

Or we can compare score_number with 1 to define the first row in each group.

df['reference_score'] = (df.groupby('name')['score']
                         .transform('first')
                         .mask(df['score_number'].eq(1))

CodePudding user response:

Another solution:

df["reference_score"] = df.groupby("name")["score"].apply(
    lambda x: pd.Series([x.iat[0]] * len(x), index=x.index).shift()
)
print(df)

Prints:

  name  name_ID  score  score_number  reference_score
0    A        1    400             1              NaN
1    A        1    500             2            400.0
2    B        2   3000             1              NaN
3    B        2   1000             2           3000.0
4    B        2   4000             3           3000.0
5    C        3    600             1              NaN
6    C        3    750             2            600.0
  • Related