group by in pandas API on spark-CodePudding

I have a pandas dataframe below,

data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
   'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
   'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
   'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
   'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(data)

Here df is a Pandas dataframe.

I am trying to convert this dataframe to pandas API on spark

import pyspark.pandas as ps
pdf = ps.from_pandas(df)
print(type(pdf))

Now the dataframe type is '<class 'pyspark.pandas.frame.DataFrame'> ' No I am applying group by function on pdf like below,

for i,j in pdf.groupby("Team"):
    print(i)
    print(j)

I am getting an error below like

KeyError: (0,)

Not sure this functionality will work on pandas API on spark ?

CodePudding user response：

The pyspark pandas does not implement all functionalities as-is because Spark has distributed architecture. Hence operations like rowwise iterations etc. can be subjective.

If you want to print the groups, then pyspark pandas code:

pdf.groupby("Team").apply(lambda g: print(f"{g.Team.values[0]}\n{g}"))

is equivalent to pandas code:

for name, sub_grp in df.groupby("Team"):
    print(name)
    print(sub_grp)

Reference to source code

If you scan the source code, you will find that there is no __iter__() implementation for pyspark pandas: https://spark.apache.org/docs/latest/api/python/_modules/pyspark/pandas/groupby.html

but the iterator yields (group_name, sub_group) for pandas: https://github.com/pandas-dev/pandas/blob/v1.5.1/pandas/core/groupby/groupby.py#L816

Documentation reference to iterate groups

pyspark pandas : https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/groupby.html?highlight=groupby#indexing-iteration

pandas : https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html#iterating-through-groups

CodePudding user response：

If you want to see the given groups just define your pyspark df correctly and utilize the print statement with the given results of the generator. Or just use pandas

for i in df.groupby("Team"):
    print(i)

for i in pdf.groupBy("Team"):
    print(i)