Home > Mobile >  SUMIF equivalent with unique date ranges in Python (Summing if date falls within various date ranges
SUMIF equivalent with unique date ranges in Python (Summing if date falls within various date ranges

Time:11-02

I am looking to create variables that sum based on date ranges unique to different features / categories to automate a current Excel task in Python. It is like a SUMIF in Excel but unique date ranges for different variables. I`ll try to recreate a similar situation as I cannot share the exact data. At the moment, I have a sales dataframe with sales per week by area like so:

Week Area Sales
08/02/2019 London 200
08/02/2019 Scotland 150
15/02/2019 London 100
15/02/2019 Scotland 120
22/02/2019 London 50
22/02/2019 Scotland 20

I want to incorporate whether the date falls within sales periods for products, so say I have another dataframe like this:

Product Sale Start Week Sale End Week
Boots 08/02/2019 15/02/2019
Accessories 15/02/2019 22/02/2019

I want to create something that sums if the dates fall within those specified for each product. For example, for Boots below, sum Sales if the weeks in Sales fall within the Sales Periods date range:

Area Boots Accessories
London 300 150
Scotland 270 140

I`ve tried groupby and a pivot table but I am not sure how to incorporate the sales dates filters into it. At the moment, the sales period dataframe and the sales dataframe are separate.

This is what I have for the pivot code which is almost there:

test = pd.pivot_table(df,index=['Area','Week'],columns=sales_period_df['Product'],values=['Sales'],aggfunc=np.sum)

But this doesnt include filtering for the sales periods and I`m not sure how to incorporate this. Would appreciate your advice, thanks in advance!

CodePudding user response:

# DF: sales (top DF in question)
# DF2: sales period (second DF in question)

# format the date into datetime
df['Week'] = pd.to_datetime(df['Week'], dayfirst=True)
df2[['Sale Start Week','Sale End Week']]=df2[['Sale Start Week','Sale End Week']].apply(pd.to_datetime, dayfirst=True)
df2

# merge using merge_asof 
df3=pd.merge_asof( df.sort_values('Week'),
                  df2.sort_values('Sale Start Week'),
                  left_on = 'Week',
                  right_on='Sale Start Week')

# including only when week falls within end week
df3=df3.loc[df3['Week'] <= df3['Sale End Week']]
df3

# cross tab for resultset
out= (pd.crosstab(index=df3['Area'], 
                  columns=df3['Product'], 
                  values=df3['Sales'], 
                  aggfunc='sum')
      .reset_index()
      .rename_axis(columns=None))

out
    Area    Accessories     Boots
0   London      150         200
1   Scotland    140         150

CodePudding user response:

Due to overlapping periods, we can't use the classic pivoting in this case (unless we duplicate overlapping sales records for each period, wich seems too much). So we have to create this table manually.

To start, let's prepare some data to work with:

import pandas as pd
from io import StringIO

data = '''Week,Area,Sales
08/02/2019,London,200
08/02/2019,Scotland,150
15/02/2019,London,100
15/02/2019,Scotland,120
22/02/2019,London,50
22/02/2019,Scotland,20'''

df = pd.read_csv(StringIO(data), index_col=0, parse_dates=True, dayfirst=True).sort_index()

data = '''Product,Sale Start Week,Sale End Week
Boots,08/02/2019,15/02/2019
Accessories,15/02/2019,22/02/2019
Something,08/02/2019,22/02/2019'''

sales_period_df = pd.read_csv(StringIO(data), index_col=0, parse_dates=[1, 2], dayfirst=True)

The structure of df and sales_period_df is slightly modified so that Week and Product are now indexes.

Next, we prepare the output frame and supportive data:

import pandas.IndexSlice as idx

# create slices from sales_period_df
# which can be used to locate data in df
periods = sales_period_df.agg(lambda row: idx[row['Sale Start Week']:row['Sale End Week']], axis=1)

# separate sales by area
sales_by_area = df.groupby('Area')['Sales']

# create the output DataFrame with unique areas as indexes 
# and products as columns
output = pd.DataFrame(index=df['Area'].unique(), columns=sales_period_df.index)

To fill in the data, we can use eather apply or agg like this:

for product in output.columns:
    output[product] = sales_by_area.agg(lambda sales: sales.loc[periods[product]].sum())

Let's assemble the code:

import pandas as pd
from pandas import IndexSlice as idx
from io import StringIO

data = '''Week,Area,Sales
08/02/2019,London,200
08/02/2019,Scotland,150
15/02/2019,London,100
15/02/2019,Scotland,120
22/02/2019,London,50
22/02/2019,Scotland,20'''

df = pd.read_csv(StringIO(data), index_col=0, parse_dates=True, dayfirst=True).sort_index()

data = '''Product,Sale Start Week,Sale End Week
Boots,08/02/2019,15/02/2019
Accessories,15/02/2019,22/02/2019
Something,08/02/2019,22/02/2019'''

sales_period_df = pd.read_csv(StringIO(data), index_col=0, parse_dates=[1, 2], dayfirst=True)

periods = sales_period_df.agg(lambda row: idx[row['Sale Start Week']:row['Sale End Week']], axis=1)
output = pd.DataFrame(index=df['Area'].unique(), columns=sales_period_df.index)
sales_by_area = df.groupby('Area')['Sales']

for product in output.columns:
    output[product] = sales_by_area.agg(lambda sales: sales.loc[periods[product]].sum())

print(output)

Output:

Product   Boots  Accessories  Something
London      300          150        350
Scotland    270          140        290
  • Related