I have a dataframe with product sales to different accounts. Some of our products we haven't sold to those accounts, but I need it included in the dataframe with a 0 value. Here is what I have:
account | degree_type | discipline | dollars |
---|---|---|---|
Harvard | 4 Year | Accounting | 1000 |
Yale | 4 Year | Biology | 2000 |
Ivy Tech | 2 Year | Biology | 500 |
Harvard | 4 Year | Precalculus | 3000 |
Let's assume we've never sold Biology to Harvard. I need a dataframe that looks like this which includes Harvard Biology at 0:
account | degree_type | discipline | dollars |
---|---|---|---|
Harvard | 4 Year | Accounting | 1000 |
Yale | 4 Year | Biology | 2000 |
Ivy Tech | 2 Year | Biology | 500 |
Harvard | 4 Year | Precalculus | 3000 |
Harvard | 4 Year | Biology | 0 |
The actual dataset is much bigger, of course, and contains about 3,000 schools and 149 different disciplines. So I need to find all the 0's and then append the values to the dataframe.
Anyone have any thoughts on how I could create these rows without writing over the sales I do have?
CodePudding user response:
You can append
out = df.append(pd.Series(['Harvard','4 Year','Biology',0],index=df.columns),ignore_index=True)
Out[49]:
account degree_type discipline dollars
0 Harvard 4 Year Accounting 1000
1 Yale 4 Year Biology 2000
2 Ivy Tech 2 Year Biology 500
3 Harvard 4 Year Precalculus 3000
4 Harvard 4 Year Biology 0
CodePudding user response:
If I understood it correctly, you want to consider all the distinct combinations resulting from the cartesian product of the columns 'account', 'degree_type' and 'discipline'.
In that case, you can do
import pandas as pd
df = pd.DataFrame({'account': {0: 'Harvard', 1: 'Yale', 2: 'Ivy Tech', 3: 'Harvard'},
'degree_type': {0: '4 Year', 1: '4 Year', 2: '2 Year', 3: '4 Year'},
'discipline': {0: 'Accounting', 1: 'Biology', 2: 'Biology', 3: 'Precalculus'},
'dollars': {0: 1000, 1: 2000, 2: 500, 3: 3000}})
# get the unique values of each column
idx_levels = {col: df[col].unique() for col in ['account', 'degree_type', 'discipline']}
# construct a new index from the cartesian product of those columns
new_idx = pd.MultiIndex.from_product(idx_levels.values(), names=idx_levels.keys())
# set those columns as the index and reindex the DataFrame to include all possible combinations
# set the 'dollars' of combinations not present in the original DataFrame to zero
res = df.set_index(list(idx_levels)).reindex(new_idx, fill_value=0).reset_index()
>>> df
account degree_type discipline dollars
0 Harvard 4 Year Accounting 1000
1 Yale 4 Year Biology 2000
2 Ivy Tech 2 Year Biology 500
3 Harvard 4 Year Precalculus 3000
>>> res
account degree_type discipline dollars
0 Harvard 4 Year Accounting 1000
1 Harvard 4 Year Biology 0
2 Harvard 4 Year Precalculus 3000
3 Harvard 2 Year Accounting 0
4 Harvard 2 Year Biology 0
5 Harvard 2 Year Precalculus 0
6 Yale 4 Year Accounting 0
7 Yale 4 Year Biology 2000
8 Yale 4 Year Precalculus 0
9 Yale 2 Year Accounting 0
10 Yale 2 Year Biology 0
11 Yale 2 Year Precalculus 0
12 Ivy Tech 4 Year Accounting 0
13 Ivy Tech 4 Year Biology 0
14 Ivy Tech 4 Year Precalculus 0
15 Ivy Tech 2 Year Accounting 0
16 Ivy Tech 2 Year Biology 500
17 Ivy Tech 2 Year Precalculus 0