Home > Back-end >  New column that counts the frequency that a value occurs in a shifted Pandas dataframe column
New column that counts the frequency that a value occurs in a shifted Pandas dataframe column

Time:09-28

I have a dataframe that looks like

ID   Date       feature
1    2020-05-01 2 
1    2020-05-01 3 
1    2020-05-01 4
2    2019-03-15 3
2    2019-03-15 2
3    2022-04-22 5
3    2022-04-22 8
3    2022-04-22 4
3    2022-04-22 2
4    2015-01-18 4
4    2015-01-18 6
4    2015-01-18 7

I sort it by time in descending order using df.sort_values('Date', ascending=False, inplace=True)

ID   Date       feature
3    2022-04-22 5
3    2022-04-22 8
3    2022-04-22 4
3    2022-04-22 2
1    2020-05-01 2 
1    2020-05-01 3 
1    2020-05-01 4
2    2019-03-15 3
2    2019-03-15 2
4    2015-01-18 4
4    2015-01-18 6
4    2015-01-18 7

and I want to add a new column n_ID@t-1 that counts the number of times that an element occur in the last ID, so the desired output should look like

ID   Date       feature n_ID@t-1
3    2022-04-22 5       3   #since the last ID 1 has 3 rows
3    2022-04-22 8       3
3    2022-04-22 4       3
3    2022-04-22 2       3
1    2020-05-01 2       2   #since the last ID 2 has 2 rows
1    2020-05-01 3       2 
1    2020-05-01 4       2
2    2019-03-15 3       3   #since the last ID 4 has 3 rows
2    2019-03-15 2       3
4    2015-01-18 4       0   #since there is no last ID
4    2015-01-18 6       0
4    2015-01-18 7       0  

I know the .value_counts() and the transform functions but I don't know how to make use of this method to make the new column.

CodePudding user response:

You can use groupby.size/value_counts without sorting to get the size of each group and map shift to map the values of the previous groups:

s = df.groupby('ID', sort=False)['ID'].size()
# or
# s = df['ID'].value_counts(sort=False)

df['n_ID@t-1'] = df['ID'].map(s.shift(-1, fill_value=0))

output:

    ID        Date  feature  n_ID@t-1
5    3  2022-04-22        5         3
6    3  2022-04-22        8         3
7    3  2022-04-22        4         3
8    3  2022-04-22        2         3
0    1  2020-05-01        2         2
1    1  2020-05-01        3         2
2    1  2020-05-01        4         2
3    2  2019-03-15        3         4
4    2  2019-03-15        2         4
9    4  2015-01-18        4         0
10   4  2015-01-18        6         0
11   4  2015-01-18        7         0

intermediate s:

ID
1    3
2    2
3    4
4    3
Name: feature, dtype: int64
  • Related