I've a pyspark dataframe requirement, which i need inputs on :
Here is the scenario :
df1 schema:
root
|-- applianceName: string (nullable = true)
|-- customer: string (nullable = true)
|-- daysAgo: integer (nullable = true)
|-- countAnomaliesByDay: long (nullable = true)
Sample Data:
applianceName | customer | daysAgo| countAnomaliesByDay
app1 cust1 0 100
app1 cust1 1 200
app1 cust1 2 300
app1 cust1 3 400
app1 cust1 4 500
app1 cust1 5 600
app1 cust1 6 700
In df1 schema, I need to add columns - day0,day1,day2,day3,day4,day5,day6 as shown below :
applianceName | customer | day0 | day1| day2 | day3 | day4 | day5| day6
app1 cust1 100 200 300 400 500 600 700
i.e. column day0 - will have countAnomaliesByDay when daysAgo =0, column day1 - will have countAnomaliesByDay when daysAgo =1 and so on.
How do i achieve this ?
tia!
CodePudding user response:
df_pivot = df.groupby('applianceName', 'customer') \
.pivot('daysAgo') \
.max('countAnomaliesByDay') \
.fillna(0)
CodePudding user response:
I hope, this work for your solution. I use pivot function of pyspark to perform this,
import findspark
findspark.init()
findspark.find()
from pyspark.sql import *
from pyspark.sql.types import IntegerType, StringType, StructType, StructField
# create a Spark Session
spark = SparkSession.builder.appName('StackOverflowMultiple').getOrCreate()
newDF = [
StructField('applianceName', StringType(), True),
StructField('customer', StringType(), True),
StructField('daysAgo', StringType(), True),
StructField('countAnomaliesByDay', IntegerType(), True)
]
finalStruct = StructType(fields=newDF)
df = spark.read.csv('./pyspark_add_multiple_cols.csv',
schema=finalStruct,
header=True)
df_pivot = df \
.groupBy('applianceName', 'customer') \
.pivot('daysAgo') \
.sum('countAnomaliesByDay')
df_pivot.show(truncate=False)