Home > database >  group by in pandas API on spark
group by in pandas API on spark


I have a pandas dataframe below,

data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
   'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
   'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
   'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
df = pd.DataFrame(data)

Here df is a Pandas dataframe.

I am trying to convert this dataframe to pandas API on spark

import pyspark.pandas as ps
pdf = ps.from_pandas(df)

Now the dataframe type is '<class 'pyspark.pandas.frame.DataFrame'> ' No I am applying group by function on pdf like below,

for i,j in pdf.groupby("Team"):

I am getting an error below like

KeyError: (0,)

Not sure this functionality will work on pandas API on spark ?

CodePudding user response:

The pyspark pandas does not implement all functionalities as-is because Spark has distributed architecture. Hence operations like rowwise iterations etc. can be subjective.

If you want to print the groups, then pyspark pandas code:

pdf.groupby("Team").apply(lambda g: print(f"{g.Team.values[0]}\n{g}"))

is equivalent to pandas code:

for name, sub_grp in df.groupby("Team"):

Reference to source code

If you scan the source code, you will find that there is no __iter__() implementation for pyspark pandas: https://spark.apache.org/docs/latest/api/python/_modules/pyspark/pandas/groupby.html

but the iterator yields (group_name, sub_group) for pandas: https://github.com/pandas-dev/pandas/blob/v1.5.1/pandas/core/groupby/groupby.py#L816

Documentation reference to iterate groups

pyspark pandas : https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/groupby.html?highlight=groupby#indexing-iteration

pandas : https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html#iterating-through-groups

CodePudding user response:

If you want to see the given groups just define your pyspark df correctly and utilize the print statement with the given results of the generator. Or just use pandas

for i in df.groupby("Team"):


for i in pdf.groupBy("Team"):
  • Related