How to use the new column name created on the same line of code in Pandas-CodePudding

I have a df

and now I do this:

df.groupby(['StudentID', 'Major']).size().reset_index(name='Freq')

In the code above, I created a new column Freq which calculates the frequency of a combination of StudentID and Major

But now I want to get the data that has Freq of 1 or greater on the same line.

Eg:

df.groupby(['StudentID', 'Major']).size().reset_index(name='Freq')[df['Freq'] > 1]

which does not work since the original df does not have Freq column.

One possible way is to save the filtered value into a new DataFrame, lets say, df2 and then filter using df2[df2['Freq'] > 1] but I want to know if there is a way to use it in one line of code.

CodePudding user response：

You can use pipe:

out = (df.groupby(['StudentID', 'Major'])
         .size()
         .reset_index(name='Freq')
         .pipe(lambda x: x[x['Freq']>1]))

CodePudding user response：

You can do it this way using the walrus operator :=.

import pandas as pd
df = pd.DataFrame({
    'StudentID' : [1,1,1,2,2,2,2,3,3],
    'Major' : 'math,english,english,math,physics,physics,physics,math,classics'.split(',')
})
print(df)
x = (df := df.groupby(['StudentID', 'Major']).size().reset_index(name='Freq'))[df['Freq'] > 1]
print(x)

Output:

   StudentID     Major
0          1      math
1          1   english
2          1   english
3          2      math
4          2   physics
5          2   physics
6          2   physics
7          3      math
8          3  classics
   StudentID    Major  Freq
0          1  english     2
3          2  physics     3