Home > Net >  Creating a column with the first occurrence of a given variable for each ID in a data panel (python)
Creating a column with the first occurrence of a given variable for each ID in a data panel (python)

Time:03-29

I have an unbalanced data panel where the observations for each ID are already in ascending order by ID student period.

Basically, I want to create a new column where for each row of a specific ID will be recorded the first type of scholarship that the student receives in the period that he enters the data panel:

E.g.: If I have observations for ID xxxx in the 4, 5, 6 and 7 period, I want to record in this new column, for all the periods of that ID, the type of scholarship he received in the 4 period.

In R, I'm able to do this using dplyr::first:

df = df %>%
  dplyr::mutate(scholarship_first = dplyr::first(scholarship))

Input:

ID     student_period         scholarship
   
4567        1              scholarship_level_1
4567        2              scholarship_level_2
4567        3              scholarship_level_2
4567        4              scholarship_level_3
5478        4              scholarship_level_3
5478        5              scholarship_level_3
6758        7              scholarship_level_1
6758        8              scholarship_level_2
6758        9              scholarship_level_2

Output:


ID     student_period         scholarship            scholarship_first
   
4567        1              scholarship_level_1      scholarship_level_1
4567        2              scholarship_level_2      scholarship_level_1 
4567        3              scholarship_level_2      scholarship_level_1
4567        4              scholarship_level_3      scholarship_level_1
5478        4              scholarship_level_3      scholarship_level_3
5478        5              scholarship_level_3      scholarship_level_3
6758        7              scholarship_level_1      scholarship_level_1
6758        8              scholarship_level_2      scholarship_level_1
6758        9              scholarship_level_2      scholarship_level_1

As I am just starting to study Python now, I don't know yet how I could do this using this language. Can someone help me?

CodePudding user response:

IIUC

df["scholarship_first"] = df.groupby(level = 0)["scholarship"].first()

      student_period          scholarship    scholarship_first
ID                                                            
4567               1  scholarship_level_1  scholarship_level_1
4567               2  scholarship_level_2  scholarship_level_1
4567               3  scholarship_level_2  scholarship_level_1
4567               4  scholarship_level_3  scholarship_level_1
5478               4  scholarship_level_3  scholarship_level_3
5478               5  scholarship_level_3  scholarship_level_3
6758               7  scholarship_level_1  scholarship_level_1
6758               8  scholarship_level_2  scholarship_level_1
6758               9  scholarship_level_2  scholarship_level_1

pandas.DataFrame.groupby let you group by ID if you specify level = 0. Then you can get the first occurrence of the group with pandas.core.groupby.GroupBy.first.

CodePudding user response:

With datar, a re-imagination of pandas APIs, you can do it as easy as you did in R:

>>> from datar.all import f, tribble, first, mutate, group_by
>>> df = tribble(
...     f.ID,  f.student_period, f.scholarship,
...     4567,  1,                "scholarship_level_1",
...     4567,  2,                "scholarship_level_2",
...     4567,  3,                "scholarship_level_2",
...     4567,  4,                "scholarship_level_3",
...     5478,  4,                "scholarship_level_3",
...     5478,  5,                "scholarship_level_3",
...     6758,  7,                "scholarship_level_1",
...     6758,  8,                "scholarship_level_2",
...     6758,  9,                "scholarship_level_2",
... )
>>> 
>>> df >> group_by(f.ID) >> mutate(scholarship_first=first(f.scholarship))
       ID  student_period          scholarship    scholarship_first
  <int64>         <int64>             <object>             <object>
0    4567               1  scholarship_level_1  scholarship_level_1
1    4567               2  scholarship_level_2  scholarship_level_1
2    4567               3  scholarship_level_2  scholarship_level_1
3    4567               4  scholarship_level_3  scholarship_level_1
4    5478               4  scholarship_level_3  scholarship_level_3
5    5478               5  scholarship_level_3  scholarship_level_3
6    6758               7  scholarship_level_1  scholarship_level_1
7    6758               8  scholarship_level_2  scholarship_level_1
8    6758               9  scholarship_level_2  scholarship_level_1
[TibbleGrouped: ID (n=3)]
  • Related