Create new column based on current row with calculations involving other rows-CodePudding

Suppose I have dataframe like this:

id,month,price
1,2021-04-31,9
1,2021-01-31,5
1,2021-02-31,6
1,2021-03-31,8

So for each same ID I want to fetch sum of price for current row month-1 and -2

For example for row 1,march,8, I will get output as 5 6=11 in a new column since for current march row past two months are jan and feb

There will be other ids as well in the main data

CodePudding user response：

Convert month names into month numbers then use it for ordering in a Window partitioned by id for running sum:

from pyspark.sql import functions as F, Window

df = spark.createDataFrame([
    (1, "apr", 9), (1, "jan", 5),
    (1, "feb", 6), (1, "march", 8)
], ["id", "month", "price"])

# handle both full and short textual representation of month names
month_number = F.when(F.length("month") == 3, F.month(F.to_date(F.col("month"), "MMM"))) \
    .otherwise(F.month(F.to_date(F.col("month"), "MMMM")))

w = Window.partitionBy("id").orderBy(month_number).rangeBetween(-2, -1)

df.withColumn("price_sum", F.sum("price").over(w)).show()

# --- ----- ----- --------- 
#| id|month|price|price_sum|
# --- ----- ----- --------- 
#|  1|  jan|    5|     null|
#|  1|  feb|    6|        5|
#|  1|march|    8|       11|
#|  1|  apr|    9|       14|
# --- ----- ----- ---------

For your updated question, you can truncate the dates into month unit then use window with range between interval -2 months and interval -1 months:

df = spark.createDataFrame([
    (1, "2021-04-30", 9), (1, "2021-01-31", 5),
    (1, "2021-02-28", 6), (1, "2021-03-31", 8)
], ["id", "month", "price"])

df.withColumn(
    "date",
    F.date_trunc("month", F.col("month"))
).withColumn(
    "price_sum",
    F.expr("""sum(price) over(partition by id order by date 
                              range between interval 2 months preceding 
                              and interval 1 months preceding)
    """)
).drop("date").show()

CodePudding user response：

Use window function to SUM on price, PARTITION BY month (numeric), and use ROWS to get two preceding rows.

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

df = spark.read.option("header",True).csv("path/to/file") # Assuming file is csv
df.createOrReplaceTempView('df')
df1 = spark.sql("""
           SELECT id,month,price,
             CASE
               WHEN month = 'jan' THEN 1
               WHEN month = 'feb' THEN 2
               .
               .
               .
               else 12
             END AS month_num
           FROM df 
           """)

df1.createOrReplaceTempView('df1')
spark.sql("""
           SELECT id, month, price,
           SUM(price) OVER (PARTITION BY id ORDER BY month_num ROWS 2 PRECEDING) AS price_sum
           FROM df1 
           """).show()

Add to the second query

WHERE month_num NOT IN (1,2)

if you want price_sum 0 for jan & feb.