I have an unbalanced data panel where the observations for each ID are already in ascending order by ID student period.
Basically, I want to create a new column where for each row of a specific ID will be recorded the first type of scholarship that the student receives in the period that he enters the data panel:
E.g.: If I have observations for ID xxxx in the 4, 5, 6 and 7 period, I want to record in this new column, for all the periods of that ID, the type of scholarship he received in the 4 period.
In R, I'm able to do this using dplyr::first
:
df = df %>%
dplyr::mutate(scholarship_first = dplyr::first(scholarship))
Input:
ID student_period scholarship
4567 1 scholarship_level_1
4567 2 scholarship_level_2
4567 3 scholarship_level_2
4567 4 scholarship_level_3
5478 4 scholarship_level_3
5478 5 scholarship_level_3
6758 7 scholarship_level_1
6758 8 scholarship_level_2
6758 9 scholarship_level_2
Output:
ID student_period scholarship scholarship_first
4567 1 scholarship_level_1 scholarship_level_1
4567 2 scholarship_level_2 scholarship_level_1
4567 3 scholarship_level_2 scholarship_level_1
4567 4 scholarship_level_3 scholarship_level_1
5478 4 scholarship_level_3 scholarship_level_3
5478 5 scholarship_level_3 scholarship_level_3
6758 7 scholarship_level_1 scholarship_level_1
6758 8 scholarship_level_2 scholarship_level_1
6758 9 scholarship_level_2 scholarship_level_1
As I am just starting to study Python now, I don't know yet how I could do this using this language. Can someone help me?
CodePudding user response:
IIUC
df["scholarship_first"] = df.groupby(level = 0)["scholarship"].first()
student_period scholarship scholarship_first
ID
4567 1 scholarship_level_1 scholarship_level_1
4567 2 scholarship_level_2 scholarship_level_1
4567 3 scholarship_level_2 scholarship_level_1
4567 4 scholarship_level_3 scholarship_level_1
5478 4 scholarship_level_3 scholarship_level_3
5478 5 scholarship_level_3 scholarship_level_3
6758 7 scholarship_level_1 scholarship_level_1
6758 8 scholarship_level_2 scholarship_level_1
6758 9 scholarship_level_2 scholarship_level_1
pandas.DataFrame.groupby
let you group by ID
if you specify level = 0
. Then you can get the first occurrence of the group with pandas.core.groupby.GroupBy.first
.
CodePudding user response:
With datar
, a re-imagination of pandas APIs, you can do it as easy as you did in R:
>>> from datar.all import f, tribble, first, mutate, group_by
>>> df = tribble(
... f.ID, f.student_period, f.scholarship,
... 4567, 1, "scholarship_level_1",
... 4567, 2, "scholarship_level_2",
... 4567, 3, "scholarship_level_2",
... 4567, 4, "scholarship_level_3",
... 5478, 4, "scholarship_level_3",
... 5478, 5, "scholarship_level_3",
... 6758, 7, "scholarship_level_1",
... 6758, 8, "scholarship_level_2",
... 6758, 9, "scholarship_level_2",
... )
>>>
>>> df >> group_by(f.ID) >> mutate(scholarship_first=first(f.scholarship))
ID student_period scholarship scholarship_first
<int64> <int64> <object> <object>
0 4567 1 scholarship_level_1 scholarship_level_1
1 4567 2 scholarship_level_2 scholarship_level_1
2 4567 3 scholarship_level_2 scholarship_level_1
3 4567 4 scholarship_level_3 scholarship_level_1
4 5478 4 scholarship_level_3 scholarship_level_3
5 5478 5 scholarship_level_3 scholarship_level_3
6 6758 7 scholarship_level_1 scholarship_level_1
7 6758 8 scholarship_level_2 scholarship_level_1
8 6758 9 scholarship_level_2 scholarship_level_1
[TibbleGrouped: ID (n=3)]