Home > OS >  How to insert missing rows in multi-index
How to insert missing rows in multi-index

Time:01-27

I have a dataframe that looks something like this

df = pd.DataFrame({'year':[23,23,23,23,23,23], 'month':[1,1,1,2,3,3], 'utility':['A','A','B','A','A','B'], 'state':['NY','NJ','CA','NJ','NY','CA']})

   year  month utility state
0    23      1       A    NY
1    23      1       A    NJ
2    23      1       B    CA
3    23      2       A    NJ
4    23      3       A    NY
5    23      3       B    CA

And I would like to create new rows for utilities-state combinations with missing months. So the new dataframe would look something like this

   year  month utility state
0    23      1       A    NY
1    23      1       A    NJ
2    23      1       B    CA
3    23      2       A    NY
4    23      2       A    NJ
5    23      2       B    CA
6    23      3       A    NY
7    23      3       A    NJ
8    23      3       B    CA

I know that I can use a MultiIndex and then reindex, but using the from_product() method results in utility-state combinations not present in the original df (I do not want a utility A - CA combination, for example).

I thought about concatenating the utility and state columns and then getting the cartesian product from that, but I think there must be a simpler method.

CodePudding user response:

A possible solution:

cols = ['utility', 'state']
d1 = df.drop_duplicates(cols)
d2 = df.drop_duplicates(['year', 'month'])

d2.assign(**{x: [d1[x].to_list()] * len(d2) for x in cols}).explode(cols)

Output:

   year  month utility state
0    23      1       A    NY
0    23      1       A    NJ
0    23      1       B    CA
3    23      2       A    NY
3    23      2       A    NJ
3    23      2       B    CA
4    23      3       A    NY
4    23      3       A    NJ
4    23      3       B    CA

I was wondering whether a solution using numpy broadcasting would be possible, and it is:

cols1, cols2 = ['year', 'month'], ['utility', 'state']

(pd.DataFrame(
    np.vstack(np.concatenate(
        np.broadcast_arrays(
            df[cols1].drop_duplicates(cols1).values[:,None], 
            df[cols2].drop_duplicates(cols2).values), axis=2)),
    columns=df.columns))

CodePudding user response:

One option is with DataFrame.complete from . For your data, you are basically doing a combination of (year, month) and (utility, state):

# pip install pyjanitor
import janitor 
    ​
df.complete(('year', 'month'), ('utility', 'state'))
   year  month utility state
0    23      1       A    NY
1    23      1       A    NJ
2    23      1       B    CA
3    23      2       A    NY
4    23      2       A    NJ
5    23      2       B    CA
6    23      3       A    NY
7    23      3       A    NJ
8    23      3       B    CA

@Timeless, undelete your code and I'll delete mine. You had a good start, and I edited your code to make it simpler.

  • Related