Home > other >  Pythonic way to split column values by multiple column conditions
Pythonic way to split column values by multiple column conditions

Time:11-05

We have online order data that has total shipping charges at the order level, but our accountants need to split the total shipping charge across several vendors, which is broken out at the line level.

Not every product has shipping and some products have promotional free shipping, which needs to be accounted for when splitting shipping charges (don't account for shipping charges when items aren't shipped or shipping is free).

I've created this test and it produces the expected outcome (big props to another coworker for getting something working), but want to understand if there is a more efficient (pythonic!) way to accomplish this.

This was previously done via odbc connection to a sql database and handled with an excel formula.

import pandas as pd
import numpy as np

df = pd.DataFrame({'id': [10, 11, 11, 11, 12, 12, 13, 14, 15, 15],
                   'shipping': [5, 5, 5, 5, 5, 5, 0, 0, 5, 5],
                   'shipstatus': [True, True, True, False,
                   True, True, False, True, True, True],
                   'freeship': [False, True, False, False,
                   False, False, False, True, False, False]})

df['a'] = df.groupby(['id','shipstatus','freeship'])['shipping'].transform('count')
# the final step of the excel code is counting (grouping) by id and shipstatus, 
# so we group those here. we also group by freeship so that the count of id/shipstatus
# won't be included when freeship is true (which we zero out later)

df['b'] = df['a'] * (df['freeship']==False) 
# if freeship is true, second piece evaluates to false, whole thing evaluates to zero

df['c'] = df['shipping']/df['b'] 
# this will give you inf where we set stuff to zero above. 
# you'll also get NaN when 'shipping' is zero

df['LineShipping'] = df['shipstatus'] * (df['freeship']==False) * df['c']
# sets the whole thing to zero if freeship is true or shipstatus is false, 
# otherwise multiplies our # previous result by 1 and so no change

df = df.fillna(0) 
# sets all the NaN to zero

df = df.drop(columns=['a','b','c']) 
# saves the dataframe but with the temp columns dropped

print(df)

Expected output

CodePudding user response:

So basically you want to distribute the shipping within an id evenly between rows with shipstatus==True and freeship==False. When shipstatus==False or freeship==True, LineShipping==0 always.

Therefore, you can just count/divide where your condition holds. That way, you don't get warning from division by zeros:

counts = (df[df['shipstatus'] & ~df['freeship']]      # only count when shipstatus == True and freeship == False
             .groupby(['id'])                         # no need to groupby shipstatus
             ['shipping'].transform('size')           # size or count
          )

# only divide where `shipstatus==True` and `freeship==False`, else is `NaN`
# then fillna with 0
df['LineShipping'] = df['shipping'].div(counts).fillna(0)

Output:

   id  shipping  shipstatus  freeship  LineShipping
0  10         5        True     False           5.0
1  11         5        True      True           0.0
2  11         5        True     False           5.0
3  11         5       False     False           0.0
4  12         5        True     False           2.5
5  12         5        True     False           2.5
6  13         0       False     False           0.0
7  14         0        True      True           0.0
8  15         5        True     False           2.5
9  15         5        True     False           2.5
  • Related