Home > database >  Pyspark Dataframe - how to add multiple columns in dataframe, based on data in 2 columns
Pyspark Dataframe - how to add multiple columns in dataframe, based on data in 2 columns

Time:12-22

I've a pyspark dataframe requirement, which i need inputs on :

Here is the scenario :

df1 schema:

root
  |-- applianceName: string (nullable = true)
  |-- customer: string (nullable = true)
  |-- daysAgo: integer (nullable = true)
  |-- countAnomaliesByDay: long (nullable = true)

Sample Data:
applianceName | customer | daysAgo| countAnomaliesByDay
app1           cust1       0        100
app1           cust1       1        200
app1           cust1       2       300
app1           cust1       3       400
app1           cust1       4       500
app1           cust1       5       600
app1           cust1       6       700

In df1 schema, I need to add columns - day0,day1,day2,day3,day4,day5,day6 as shown below :


applianceName | customer | day0 | day1| day2 | day3 | day4 | day5| day6
app1            cust1      100     200  300    400    500    600   700  

i.e. column day0 - will have countAnomaliesByDay when daysAgo =0, column day1 - will have countAnomaliesByDay when daysAgo =1 and so on. 

How do i achieve this ?

tia!

CodePudding user response:

df_pivot = df.groupby('applianceName', 'customer') \
.pivot('daysAgo') \
.max('countAnomaliesByDay') \
.fillna(0)

CodePudding user response:

I hope, this work for your solution. I use pivot function of pyspark to perform this,

import findspark

findspark.init()
findspark.find()
from pyspark.sql import *
from pyspark.sql.types import IntegerType, StringType, StructType, StructField

# create a Spark Session
spark = SparkSession.builder.appName('StackOverflowMultiple').getOrCreate()
newDF = [
    StructField('applianceName', StringType(), True),
    StructField('customer', StringType(), True),
    StructField('daysAgo', StringType(), True),
    StructField('countAnomaliesByDay', IntegerType(), True)
]
finalStruct = StructType(fields=newDF)
df = spark.read.csv('./pyspark_add_multiple_cols.csv',
                    schema=finalStruct,
                    header=True)
df_pivot = df \
    .groupBy('applianceName', 'customer') \
    .pivot('daysAgo') \
    .sum('countAnomaliesByDay')
df_pivot.show(truncate=False)
  • Related