I have a pandas dataframe below,
data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(data)
Here df is a Pandas dataframe.
I am trying to convert this dataframe to pandas API on spark
import pyspark.pandas as ps
pdf = ps.from_pandas(df)
print(type(pdf))
Now the dataframe type is '<class 'pyspark.pandas.frame.DataFrame'> ' No I am applying group by function on pdf like below,
for i,j in pdf.groupby("Team"):
print(i)
print(j)
I am getting an error below like
KeyError: (0,)
Not sure this functionality will work on pandas API on spark ?
CodePudding user response:
The pyspark pandas does not implement all functionalities as-is because Spark has distributed architecture. Hence operations like rowwise iterations etc. can be subjective.
If you want to print the groups, then pyspark pandas code:
pdf.groupby("Team").apply(lambda g: print(f"{g.Team.values[0]}\n{g}"))
is equivalent to pandas code:
for name, sub_grp in df.groupby("Team"):
print(name)
print(sub_grp)
Reference to source code
If you scan the source code, you will find that there is no __iter__()
implementation for pyspark pandas: https://spark.apache.org/docs/latest/api/python/_modules/pyspark/pandas/groupby.html
but the iterator yields (group_name, sub_group) for pandas: https://github.com/pandas-dev/pandas/blob/v1.5.1/pandas/core/groupby/groupby.py#L816
Documentation reference to iterate groups
pyspark pandas : https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/groupby.html?highlight=groupby#indexing-iteration
pandas : https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html#iterating-through-groups
CodePudding user response:
If you want to see the given groups just define your pyspark df correctly and utilize the print statement with the given results of the generator. Or just use pandas
for i in df.groupby("Team"):
print(i)
Or
for i in pdf.groupBy("Team"):
print(i)