Home > database >  what does set mean in pandas df group by fun
what does set mean in pandas df group by fun

Time:12-27

I'm new to pandas DF.

I cant understand the code inside agg . I get this is group by and agg like pyspark DF. But what does for i in set(x) mean?

Where did the set(x) come from what is pd.isNull(i)?is it just a null checker?

newDF = existing.groupby(["cols"]).agg(
    new_col1=pd.NamedAgg(column='ColA', aggfunc=lambda x: [i.split(':')[0] for i in set(x) if not pd.isnull(i)]),
    ColA=pd.NamedAgg(column='ColA', aggfunc=lambda x: [i for i in set(x) if not pd.isnull(i)]),
    ColC=pd.NamedAgg(column='ColB', aggfunc=lambda x: ', '.join([i if not pd.isnull(i) else 'Good' for i in set(x)])),
    ColE=pd.NamedAgg(column='ColD', aggfunc=lambda x: ', '.join([i if not pd.isnull(i) else 'Good' for i in set(x)]), ))

Please help me with this

CodePudding user response:

In Python sets are mutable unordered collections of unique elements. Example:

import pandas as pd
import numpy as np
lst = [1, 2, 2, 3, np.NaN, 3, 3]
a = pd.Series(data=lst)
b = set(a)
c = [i if not pd.isnull(i) else 'Good' for i in set(a)]

print(a)

0    1.0
1    2.0
2    2.0
3    3.0
4    NaN
5    3.0
6    3.0

print(b)

{nan, 1.0, 2.0, 3.0}

print(c)

['Good', 1.0, 2.0, 3.0]

CodePudding user response:

A simple example could illustrate the explanations given in the comments. With a DataFrame -

   cols  ColA  ColB
0     1   1.0    11
1     2   3.0    12
2     2   3.0    23
3     3   2.0    24
4     4   5.0    25
5     3   6.0    26
6     1   5.0    27
7     2   NaN    28

and using

newDF = df.groupby(["cols"]).agg(ColA=pd.NamedAgg(column='ColA', aggfunc = lambda x: [i for i in set(x) if not pd.isnull(i)]))

you would get:

            ColA
cols            
1     [1.0, 7.0]
2          [3.0]
3     [2.0, 6.0]
4          [5.0]

which shows the set(x) and isnull() operations working as already explained in the comments.

  • Related