How to save pyspark 'for' loop output as a single dataframe?-CodePudding

I have a basic 'for' loop that shows the number of active customers each year. I can print the output, but I want the output to be a single table/dataframe (with 2 columns: year and # customers, each iteration of the loop creates 1 row in the table)

for yr in range(2018, 2023):
  print (yr, df.filter(year(col('first_sale')) <= yr).count())

CodePudding user response：

was able to solve by creating a blank dataframe with desired schema outside the loop and using union, but still curious if there's a shorter solution?

from pyspark.sql.types import StructType,StructField, StringType, IntegerType
schema = StructType([StructField("year", IntegerType(), True), StructField("customer_count", IntegerType(), True)])

df2 = spark.createDataFrame([],schema=schema)
for yr in range(2018, 2023):
  c1 = yr
  c2 = df.filter((year(col('first_sale')) <= yr)).count()
  newRow= spark.createDataFrame([(c1,c2)], schema)
  df2 = df2.union(newRow)
  
display(df2)

CodePudding user response：

I don't have your data, so I can't test if this works, but how about something like this:

year_col = year(col('first_sale')).alias('year')
grp = df.groupby(year_col).count().toPandas().sort_values('year').reset_index(drop=True)
grp['cumsum'] = grp['count'].cumsum()

The view grp[['year', 'cumsum']] should be the same as your for-loop.