I have the following data frame:
df = {
'name': ['A', 'A', 'B', 'B', 'B', 'C', 'C'],
'name_ID' : [1, 1, 2, 2, 2, 3, 3],
'score' : [400, 500, 3000, 1000, 4000, 600, 750],
'score_number' : [1, 2, 1, 2, 3, 1, 2]
}
df = pd.DataFrame(df)
Note that the df is grouped by name
/ name_ID
. names
can have n scores, e.g. A
has 2 scores, whereas B
has 3 scores. I want an additional column, that indicates the first score per name
/ name_ID
. The reference_score
for the first scores of a name should be NaN. Like this:
I have tried:
df_v2['first_fund'] =df_v2['fund_size'].groupby(df_v2['firm_ID']).first()
,
also with .nth
but it didn't work.
Thanks in advance.
CodePudding user response:
Let's use groupby.transform
to get first row value then mask
the first row as NaN
with condition ~df.duplicated('name', keep='first')
.
# sort the dataframe first if score number is not ascending
# df = df.sort_values(['name_ID', 'score_number'])
df['reference_score'] = (df.groupby('name')['score']
.transform('first')
.mask(~df.duplicated('name', keep='first')))
print(df)
name name_ID score score_number reference_score
0 A 1 400 1 NaN
1 A 1 500 2 400.0
2 B 2 3000 1 NaN
3 B 2 1000 2 3000.0
4 B 2 4000 3 3000.0
5 C 3 600 1 NaN
6 C 3 750 2 600.0
Or we can compare score_number
with 1
to define the first row in each group.
df['reference_score'] = (df.groupby('name')['score']
.transform('first')
.mask(df['score_number'].eq(1))
CodePudding user response:
Another solution:
df["reference_score"] = df.groupby("name")["score"].apply(
lambda x: pd.Series([x.iat[0]] * len(x), index=x.index).shift()
)
print(df)
Prints:
name name_ID score score_number reference_score
0 A 1 400 1 NaN
1 A 1 500 2 400.0
2 B 2 3000 1 NaN
3 B 2 1000 2 3000.0
4 B 2 4000 3 3000.0
5 C 3 600 1 NaN
6 C 3 750 2 600.0