Home > Software engineering >  PySpark Data Visualization from String Values in Columns
PySpark Data Visualization from String Values in Columns

Time:05-07

I have a table which has the information as shown in the table from a Pyspark dataframeenter image description here

I need to perform a data visualization by plotting the number of completed studies each month in a given year. I am of the opinion that each completed (taken from the status column) will be matched against each of the months of the year, and be aggregated per year.

The challenge I am faced with is how to aggregate each of the completed against the months and subsequently in the year and then plot the data. The values of the data in the column are strings and converting them to integers is another challenge.

I understand that I need to have the status with the value "Completed" and aggregate them as per years to have two columns that would be plotted as x and y.

Any idea on how this can be achieved is appreciated.

CodePudding user response:

If I understood your question correctly, you are looking for something like the following: here you first convert the string values to an actual date column with the to_date function, then you group by such date column and perform a count of completed studies in that month-year combination.

import pyspark.sql.functions as F

new_df = df \
  .filter((F.col('Status') == 'Completed') & (F.col('Completion').isNotNull())) \
  .withColumn('date', F.to_date('Completion', 'MMM yyyy')) \
  .groupby('date') \
  .count() \
  .sort('date')

Example

# this is sample data similar to yours, with only the relevant columns
df = spark.createDataFrame([
  ('Completed', 'Jul 2020'),
  ('Completed', 'Jan 2018'),
  ('Completed', 'Dec 2018'),
  ('Completed', 'Jul 2020'),
  ('Completed', 'Jan 2018'),
  ('Completed', 'Jan 2020'),
  ('Completed', None)
], ('Status', 'Completion'))

new_df = df \
  .filter((F.col('Status') == 'Completed') & (F.col('Completion').isNotNull())) \
  .withColumn('date', F.to_date('Completion', 'MMM yyyy')) \
  .groupby('date') \
  .count() \
  .sort('date')

new_df.show()
 ---------- ----- 
|      date|count|
 ---------- ----- 
|2018-01-01|    2|
|2018-12-01|    1|
|2020-01-01|    1|
|2020-07-01|    2|
 ---------- ----- 
  • Related