Home > Software design >  Add missing rows for each Client - Python / Pandas
Add missing rows for each Client - Python / Pandas

Time:04-01

I have df with Weeks, Months and Years.

week = ['01/03/2022 - 01/09/2022', '01/10/2022 - 01/16/2022', '01/17/2022 - 01/23/2022', '01/24/2022 - 01/30/2022']
month = ["January", "January", "January", "January"]
year = [2022, 2022, 2022, 2022]

myDict = {}

myDict["Week"] = week
myDict["Month"] = month 
myDict["Year"] = year

dates_df = pd.DataFrame(data=myDict)
dates_df
         Week              Month    Year

01/03/2022 - 01/09/2022   January   2022
01/10/2022 - 01/16/2022   January   2022
01/17/2022 - 01/23/2022   January   2022
01/24/2022 - 01/30/2022   January   2022

Data looks like that (Previously Grouped):

test_data = {'CLient Id': [1,1,1,1,2,2,2,3,3],
    'Client Name': ['Tom Holland', 'Tom Holland', 'Tom Holland', 'Tom Holland', 'Brad Pitt', 'Brad Pitt', 'Brad Pitt', 'Anna Delvey', 'Anna Delvey'],
    'City': ['New York', 'New York', 'New York', 'New York', 'Los Angeles', 'Los Angeles', 'Los Angeles', 'New York', 'New York'],    
    'Week': ['01/03/2022 - 01/09/2022', '01/10/2022 - 01/16/2022', '01/17/2022 - 01/23/2022', '01/24/2022 - 01/30/2022',
            '01/03/2022 - 01/09/2022', '01/10/2022 - 01/16/2022', '01/24/2022 - 01/30/2022', '01/03/2022 - 01/09/2022',
            '01/24/2022 - 01/30/2022'], 
    'Month': ['January', 'January', 'January', 'January', 'January', 'January', 'January', 'January', 'January'], 
    'Year': [2022, 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2022],
    'Spent': [1000, 750, 1200, 850, 777, 1500, 1200, 1500, 1200]}

test_df = pd.DataFrame(data=test_data)
test_df
Client Id    Client Name          City                Week               Month    Year   Spent

1            Tom Holland       New York      01/03/2022 - 01/09/2022    January   2022   1000
1            Tom Holland       New York      01/10/2022 - 01/16/2022    January   2022   750
1            Tom Holland       New York      01/17/2022 - 01/23/2022    January   2022   1200
1            Tom Holland       New York      01/24/2022 - 01/30/2022    January   2022   850
2            Brad Pitt         Los Angeles   01/03/2022 - 01/09/2022    January   2022   777
2            Brad Pitt         Los Angeles   01/10/2022 - 01/16/2022    January   2022   1500
2            Brad Pitt         Los Angeles   01/24/2022 - 01/30/2022    January   2022   1200
3            Anna Delvey       New York      01/03/2022 - 01/09/2022    January   2022   1500
3            Anna Delvey       New York      01/24/2022 - 01/30/2022    January   2022   1200

I need to create additional rows of Client Name, City, Week, Month and Year for each Client (Id) which missing Week, from the date df Spent column in this case should be 0.

Output I need:

Client Id    Client Name          City                  Week               Month    Year   Spent

1            Tom Holland        New York       01/03/2022 - 01/09/2022    January   2022   1000
1            Tom Holland        New York       01/10/2022 - 01/16/2022    January   2022   750
1            Tom Holland        New York       01/17/2022 - 01/23/2022    January   2022   1200
1            Tom Holland        New York       01/24/2022 - 01/30/2022    January   2022   850
2            Brad Pitt          Los Angeles    01/03/2022 - 01/09/2022    January   2022   777
2            Brad Pitt          Los Angeles    01/10/2022 - 01/16/2022    January   2022   1500
2            Brad Pitt          Los Angeles    01/17/2022 - 01/23/2022    January   2022   0
2            Brad Pitt          Los Angeles    01/24/2022 - 01/30/2022    January   2022   1200
3            Anna Delvey        New York       01/03/2022 - 01/09/2022    January   2022   1500
3            Anna Delvey        New York       01/10/2022 - 01/16/2022    January   2022   0
3            Anna Delvey        New York       01/17/2022 - 01/23/2022    January   2022   0
3            Anna Delvey        New York       01/24/2022 - 01/30/2022    January   2022   1200

I was trying to code it, but got no luck.

CodePudding user response:

You could pivot reindex fillna (to get the missing data) stack (to get back to the previous shape):

columns = ['Week','Month','Year']
out = (test_df.pivot(['CLient Id', 'Client Name', 'City'], columns, ['Spent'])
       .reindex(pd.MultiIndex.from_arrays(dates_df.assign(Spent='Spent').to_numpy()[:, [-1,0,1,2]].T, 
                                          names=[None] columns), axis=1)
       .fillna(0).stack(level=columns).reset_index())

Output:

    CLient Id  Client Name         City                     Week    Month  Year   Spent  
0           1  Tom Holland     New York  01/03/2022 - 01/09/2022  January  2022  1000.0  
1           1  Tom Holland     New York  01/10/2022 - 01/16/2022  January  2022   750.0  
2           1  Tom Holland     New York  01/17/2022 - 01/23/2022  January  2022  1200.0  
3           1  Tom Holland     New York  01/24/2022 - 01/30/2022  January  2022   850.0  
4           2    Brad Pitt  Los Angeles  01/03/2022 - 01/09/2022  January  2022   777.0  
5           2    Brad Pitt  Los Angeles  01/10/2022 - 01/16/2022  January  2022  1500.0  
6           2    Brad Pitt  Los Angeles  01/17/2022 - 01/23/2022  January  2022     0.0  
7           2    Brad Pitt  Los Angeles  01/24/2022 - 01/30/2022  January  2022  1200.0  
8           3  Anna Delvey     New York  01/03/2022 - 01/09/2022  January  2022  1500.0  
9           3  Anna Delvey     New York  01/10/2022 - 01/16/2022  January  2022     0.0  
10          3  Anna Delvey     New York  01/17/2022 - 01/23/2022  January  2022     0.0  
11          3  Anna Delvey     New York  01/24/2022 - 01/30/2022  January  2022  1200.0  

If you have more than one columns to fill with 0, you could use:

columns = ['Week','Month','Year']
value_columns = ['Spent', ...]
new_df = (pd.DataFrame({'key':[1]*len(value_columns),'New':value_columns})
          .merge(dates_df.assign(key=1)).drop(columns='key'))
out = (test_df.pivot(['CLient Id', 'Client Name', 'City'], columns, value_columns)
       .reindex(pd.MultiIndex.from_arrays(zip(*new_df.to_numpy()), 
                                          names=[None]   columns), axis=1)
       .fillna(0).stack(level=columns).reset_index())
  • Related