How can I add rows to a dataframe with zero value?-CodePudding

I have a dataframe with product sales to different accounts. Some of our products we haven't sold to those accounts, but I need it included in the dataframe with a 0 value. Here is what I have:

account	degree_type	discipline	dollars
Harvard	4 Year	Accounting	1000
Yale	4 Year	Biology	2000
Ivy Tech	2 Year	Biology	500
Harvard	4 Year	Precalculus	3000

Let's assume we've never sold Biology to Harvard. I need a dataframe that looks like this which includes Harvard Biology at 0:

account	degree_type	discipline	dollars
Harvard	4 Year	Accounting	1000
Yale	4 Year	Biology	2000
Ivy Tech	2 Year	Biology	500
Harvard	4 Year	Precalculus	3000
Harvard	4 Year	Biology	0

The actual dataset is much bigger, of course, and contains about 3,000 schools and 149 different disciplines. So I need to find all the 0's and then append the values to the dataframe.

Anyone have any thoughts on how I could create these rows without writing over the sales I do have?

CodePudding user response：

You can append

out = df.append(pd.Series(['Harvard','4 Year','Biology',0],index=df.columns),ignore_index=True)
Out[49]: 
    account degree_type   discipline  dollars
0   Harvard      4 Year   Accounting     1000
1      Yale      4 Year      Biology     2000
2  Ivy Tech      2 Year      Biology      500
3   Harvard      4 Year  Precalculus     3000
4   Harvard      4 Year      Biology        0

CodePudding user response：

If I understood it correctly, you want to consider all the distinct combinations resulting from the cartesian product of the columns 'account', 'degree_type' and 'discipline'.

In that case, you can do

import pandas as pd

df = pd.DataFrame({'account': {0: 'Harvard', 1: 'Yale', 2: 'Ivy Tech', 3: 'Harvard'},
 'degree_type': {0: '4 Year', 1: '4 Year', 2: '2 Year', 3: '4 Year'},
 'discipline': {0: 'Accounting', 1: 'Biology', 2: 'Biology', 3: 'Precalculus'},
 'dollars': {0: 1000, 1: 2000, 2: 500, 3: 3000}})

# get the unique values of each column 
idx_levels = {col: df[col].unique() for col in ['account', 'degree_type', 'discipline']}

# construct a new index from the cartesian product of those columns 
new_idx = pd.MultiIndex.from_product(idx_levels.values(), names=idx_levels.keys())

# set those columns as the index and reindex the DataFrame to include all possible combinations
# set the 'dollars' of combinations not present in the original DataFrame to zero 
res = df.set_index(list(idx_levels)).reindex(new_idx, fill_value=0).reset_index()

>>> df

    account degree_type   discipline  dollars
0   Harvard      4 Year   Accounting     1000
1      Yale      4 Year      Biology     2000
2  Ivy Tech      2 Year      Biology      500
3   Harvard      4 Year  Precalculus     3000

>>> res 

     account degree_type   discipline  dollars
0    Harvard      4 Year   Accounting     1000
1    Harvard      4 Year      Biology        0
2    Harvard      4 Year  Precalculus     3000
3    Harvard      2 Year   Accounting        0
4    Harvard      2 Year      Biology        0
5    Harvard      2 Year  Precalculus        0
6       Yale      4 Year   Accounting        0
7       Yale      4 Year      Biology     2000
8       Yale      4 Year  Precalculus        0
9       Yale      2 Year   Accounting        0
10      Yale      2 Year      Biology        0
11      Yale      2 Year  Precalculus        0
12  Ivy Tech      4 Year   Accounting        0
13  Ivy Tech      4 Year      Biology        0
14  Ivy Tech      4 Year  Precalculus        0
15  Ivy Tech      2 Year   Accounting        0
16  Ivy Tech      2 Year      Biology      500
17  Ivy Tech      2 Year  Precalculus        0