I'm new to pandas DF.
I cant understand the code inside agg . I get this is group by and agg like pyspark DF. But what does for i in set(x) mean?
Where did the set(x) come from what is pd.isNull(i)?is it just a null checker?
newDF = existing.groupby(["cols"]).agg(
new_col1=pd.NamedAgg(column='ColA', aggfunc=lambda x: [i.split(':')[0] for i in set(x) if not pd.isnull(i)]),
ColA=pd.NamedAgg(column='ColA', aggfunc=lambda x: [i for i in set(x) if not pd.isnull(i)]),
ColC=pd.NamedAgg(column='ColB', aggfunc=lambda x: ', '.join([i if not pd.isnull(i) else 'Good' for i in set(x)])),
ColE=pd.NamedAgg(column='ColD', aggfunc=lambda x: ', '.join([i if not pd.isnull(i) else 'Good' for i in set(x)]), ))
Please help me with this
CodePudding user response:
In Python sets are mutable unordered collections of unique elements. Example:
import pandas as pd
import numpy as np
lst = [1, 2, 2, 3, np.NaN, 3, 3]
a = pd.Series(data=lst)
b = set(a)
c = [i if not pd.isnull(i) else 'Good' for i in set(a)]
print(a)
0 1.0
1 2.0
2 2.0
3 3.0
4 NaN
5 3.0
6 3.0
print(b)
{nan, 1.0, 2.0, 3.0}
print(c)
['Good', 1.0, 2.0, 3.0]
CodePudding user response:
A simple example could illustrate the explanations given in the comments. With a DataFrame -
cols ColA ColB
0 1 1.0 11
1 2 3.0 12
2 2 3.0 23
3 3 2.0 24
4 4 5.0 25
5 3 6.0 26
6 1 5.0 27
7 2 NaN 28
and using
newDF = df.groupby(["cols"]).agg(ColA=pd.NamedAgg(column='ColA', aggfunc = lambda x: [i for i in set(x) if not pd.isnull(i)]))
you would get:
ColA
cols
1 [1.0, 7.0]
2 [3.0]
3 [2.0, 6.0]
4 [5.0]
which shows the set(x) and isnull() operations working as already explained in the comments.