SUMIF equivalent with unique date ranges in Python (Summing if date falls within various date ranges-CodePudding

I am looking to create variables that sum based on date ranges unique to different features / categories to automate a current Excel task in Python. It is like a SUMIF in Excel but unique date ranges for different variables. I`ll try to recreate a similar situation as I cannot share the exact data. At the moment, I have a sales dataframe with sales per week by area like so:

Week	Area	Sales
08/02/2019	London	200
08/02/2019	Scotland	150
15/02/2019	London	100
15/02/2019	Scotland	120
22/02/2019	London	50
22/02/2019	Scotland	20

I want to incorporate whether the date falls within sales periods for products, so say I have another dataframe like this:

Product	Sale Start Week	Sale End Week
Boots	08/02/2019	15/02/2019
Accessories	15/02/2019	22/02/2019

I want to create something that sums if the dates fall within those specified for each product. For example, for Boots below, sum Sales if the weeks in Sales fall within the Sales Periods date range:

Area	Boots	Accessories
London	300	150
Scotland	270	140

I`ve tried groupby and a pivot table but I am not sure how to incorporate the sales dates filters into it. At the moment, the sales period dataframe and the sales dataframe are separate.

This is what I have for the pivot code which is almost there:

test = pd.pivot_table(df,index=['Area','Week'],columns=sales_period_df['Product'],values=['Sales'],aggfunc=np.sum)

But this doesnt include filtering for the sales periods and I`m not sure how to incorporate this. Would appreciate your advice, thanks in advance!

CodePudding user response：

# DF: sales (top DF in question)
# DF2: sales period (second DF in question)

# format the date into datetime
df['Week'] = pd.to_datetime(df['Week'], dayfirst=True)
df2[['Sale Start Week','Sale End Week']]=df2[['Sale Start Week','Sale End Week']].apply(pd.to_datetime, dayfirst=True)
df2

# merge using merge_asof 
df3=pd.merge_asof( df.sort_values('Week'),
                  df2.sort_values('Sale Start Week'),
                  left_on = 'Week',
                  right_on='Sale Start Week')

# including only when week falls within end week
df3=df3.loc[df3['Week'] <= df3['Sale End Week']]
df3

# cross tab for resultset
out= (pd.crosstab(index=df3['Area'], 
                  columns=df3['Product'], 
                  values=df3['Sales'], 
                  aggfunc='sum')
      .reset_index()
      .rename_axis(columns=None))

out

    Area    Accessories     Boots
0   London      150         200
1   Scotland    140         150

CodePudding user response：

Due to overlapping periods, we can't use the classic pivoting in this case (unless we duplicate overlapping sales records for each period, wich seems too much). So we have to create this table manually.

To start, let's prepare some data to work with:

import pandas as pd
from io import StringIO

data = '''Week,Area,Sales
08/02/2019,London,200
08/02/2019,Scotland,150
15/02/2019,London,100
15/02/2019,Scotland,120
22/02/2019,London,50
22/02/2019,Scotland,20'''

df = pd.read_csv(StringIO(data), index_col=0, parse_dates=True, dayfirst=True).sort_index()

data = '''Product,Sale Start Week,Sale End Week
Boots,08/02/2019,15/02/2019
Accessories,15/02/2019,22/02/2019
Something,08/02/2019,22/02/2019'''

sales_period_df = pd.read_csv(StringIO(data), index_col=0, parse_dates=[1, 2], dayfirst=True)

The structure of df and sales_period_df is slightly modified so that Week and Product are now indexes.

Next, we prepare the output frame and supportive data:

import pandas.IndexSlice as idx

# create slices from sales_period_df
# which can be used to locate data in df
periods = sales_period_df.agg(lambda row: idx[row['Sale Start Week']:row['Sale End Week']], axis=1)

# separate sales by area
sales_by_area = df.groupby('Area')['Sales']

# create the output DataFrame with unique areas as indexes 
# and products as columns
output = pd.DataFrame(index=df['Area'].unique(), columns=sales_period_df.index)

To fill in the data, we can use eather apply or agg like this:

for product in output.columns:
    output[product] = sales_by_area.agg(lambda sales: sales.loc[periods[product]].sum())

Let's assemble the code:

import pandas as pd
from pandas import IndexSlice as idx
from io import StringIO

data = '''Week,Area,Sales
08/02/2019,London,200
08/02/2019,Scotland,150
15/02/2019,London,100
15/02/2019,Scotland,120
22/02/2019,London,50
22/02/2019,Scotland,20'''

df = pd.read_csv(StringIO(data), index_col=0, parse_dates=True, dayfirst=True).sort_index()

data = '''Product,Sale Start Week,Sale End Week
Boots,08/02/2019,15/02/2019
Accessories,15/02/2019,22/02/2019
Something,08/02/2019,22/02/2019'''

sales_period_df = pd.read_csv(StringIO(data), index_col=0, parse_dates=[1, 2], dayfirst=True)

periods = sales_period_df.agg(lambda row: idx[row['Sale Start Week']:row['Sale End Week']], axis=1)
output = pd.DataFrame(index=df['Area'].unique(), columns=sales_period_df.index)
sales_by_area = df.groupby('Area')['Sales']

for product in output.columns:
    output[product] = sales_by_area.agg(lambda sales: sales.loc[periods[product]].sum())

print(output)

Output:

Product   Boots  Accessories  Something
London      300          150        350
Scotland    270          140        290