Home > database >  Pandas / Pyspark for loop column substract
Pandas / Pyspark for loop column substract

Time:07-14

I'd like to know a way to substract the values vouchers from credit balance in a dataframe.

There is a column "credit" that will try to match the vouchers used : "v1", "v2", ecc.

IS:
enter image description here

To BE:
enter image description here

So the vouchers should be covered starting from the latest to the most recent. From voucher 3 to voucher 1.

The credit column should try to cover the vouchers (from 3 to 1). If credit surpass the vouchers the remaining credit should be stored in the credit column.

I am using python notebooks with pandas and PySpark libraries.

CodePudding user response:

One of the many ways to acheive this would be using pandas.apply.
See if this helps:

import numpy as np
import pandas as pd

data={
  "name":["tom","jim"],
  "cummulative":[17,15],
  "voucher1":[10,0],
  "voucher2":[5,5],
  "voucher3":[2,10],
  "credit":[20,10]
}
df=pd.DataFrame(data)


def change_order(row):
  new_dict=row.to_dict()
  credit=row.credit
  cummulative=row.cummulative
  for i in range(3,0,-1):
    current=row[f"voucher{i}"]
    if credit>=current:
      credit-=current
      cummulative-=current
      new_dict["credit"]=credit
      new_dict["cummulative"]=cummulative
      new_dict[f"voucher{i}"]=0
  series=pd.Series(new_dict)
  return series
df=df.apply(change_order,axis=1)
print(df)

Output:

  name  cummulative  voucher1  voucher2  voucher3  credit
0  tom            0         0         0         0       3
1  jim            5         0         5         0       0

CodePudding user response:

It's actually a very nice exercise.

This is what could be created in PySpark. It feels though, there should be a nicer way...

Input:

from pyspark.sql import functions as F
df = spark.createDataFrame(
    [('tom', 17, 10, 5, 2, 20),
     ('jim', 15, 0, 5, 10, 10)],
    ['name', 'cumulative_vouchers_used', 'voucher1', 'voucher2', 'voucher3', 'credit'])

Script:

c, v1, v2, v3 = F.col('credit'), F.col('voucher1'), F.col('voucher2'), F.col('voucher3')

subt_v3 = F.when(c >= v3, c - v3).otherwise(c)
new_v3 = F.when(c >= v3, 0).otherwise(v3)
subt_v2 = F.when((subt_v3 >= v2) & (subt_v3 != c), subt_v3 - v2).otherwise(subt_v3)
new_v2 = F.when(subt_v3 >= v2, 0).otherwise(v2)
subt_v1 = F.when((subt_v2 >= v1) & (subt_v2 != subt_v3), subt_v2 - v1).otherwise(subt_v2)
new_v1 = F.when(subt_v2 >= v1, 0).otherwise(v1)
new_cum = new_v1   new_v2   new_v3

df = df.select(
    'name',
    new_cum.alias('cumulative_vouchers_used'),
    new_v1.alias('voucher1'),
    new_v2.alias('voucher2'),
    new_v3.alias('voucher3'),
    subt_v1.alias('credit')
)

df.show()
#  ---- ------------------------ -------- -------- -------- ------ 
# |name|cumulative_vouchers_used|voucher1|voucher2|voucher3|credit|
#  ---- ------------------------ -------- -------- -------- ------ 
# | tom|                       0|       0|       0|       0|     3|
# | jim|                       5|       0|       5|       0|     0|
#  ---- ------------------------ -------- -------- -------- ------ 
  • Related