Here's my pyspark dataframe
-------------------------------------------------- ----------
|date |date_count|
-------------------------------------------------- ----------
|[20210629, 20210629] |495 |
|[20210619, 20210619, 20210619] |1781 |
|[20210611] |3675263 |
|[20210611, 20210611, 20210611, 20210611, 20210611]|3 |
-------------------------------------------------- ----------
To give you clue, it come from a pivoting like this
from pyspark.sql.functions import max as pyspark_max, min as pyspark_min, sum as pyspark_sum, avg, count
timeseries_monthly = spark.read.options(header='True',inferschema='True',delimiter=',').parquet("url...")
date = timeseries_monthly.select( timeseries_monthly["gps.date"])
date.groupBy('date').agg(count('date').alias('date_count')).show(4,truncate=False)
Here's my Expected Output
---------- ----------
|date |date_count|
---------- ----------
|20210629 |495 |
|20210619 |1781 |
|20210611 |3675263 |
|20210611 |3 |
---------- ----------
CodePudding user response:
Use array_distinct(), array_join()
functions in pyspark.
Example:
df.withColumn("date", array_join(array_distinct(col("date")),'')).show()