I have a basic 'for' loop that shows the number of active customers each year. I can print the output, but I want the output to be a single table/dataframe (with 2 columns: year and # customers, each iteration of the loop creates 1 row in the table)
for yr in range(2018, 2023):
print (yr, df.filter(year(col('first_sale')) <= yr).count())
CodePudding user response:
was able to solve by creating a blank dataframe with desired schema outside the loop and using union, but still curious if there's a shorter solution?
from pyspark.sql.types import StructType,StructField, StringType, IntegerType
schema = StructType([StructField("year", IntegerType(), True), StructField("customer_count", IntegerType(), True)])
df2 = spark.createDataFrame([],schema=schema)
for yr in range(2018, 2023):
c1 = yr
c2 = df.filter((year(col('first_sale')) <= yr)).count()
newRow= spark.createDataFrame([(c1,c2)], schema)
df2 = df2.union(newRow)
display(df2)
CodePudding user response:
I don't have your data, so I can't test if this works, but how about something like this:
year_col = year(col('first_sale')).alias('year')
grp = df.groupby(year_col).count().toPandas().sort_values('year').reset_index(drop=True)
grp['cumsum'] = grp['count'].cumsum()
The view grp[['year', 'cumsum']]
should be the same as your for-loop.