Home > Software engineering >  merge pandas dataframe with itself to add new rows (like a cross join)
merge pandas dataframe with itself to add new rows (like a cross join)

Time:11-14

I would like to join my dataframe with itself in a way that it has the same amount of rows for a particular column. It sounds a bit complicated but I believe it is not when you see it. So here is an example:

year brand series model version value value 2
2022 bmw A 1X plan 3 1
2022 bmw B 2X plan 8 1
2022 bmw A 1X sold 1 1
2022 bmw C 3X sold 10 1
2021 bmw A 1X sold 50 20
2021 bmw C 3X sold 50 20
2022 bmw A 1X prediction 2 1
2022 audi D 4X prediction 7 1

I would like to have the column version the same amount of model always, like this:

year brand series model version value value 2
2022 bmw A 1X plan 3 1
2022 bmw B 2X plan 8 1
2022 bmw C 3X plan Nan Nan
2022 audi D 4X plan Nan Nan
2022 bmw A 1X sold 1 1
2022 bmw C 3X sold 10 1
2022 bmw A 1X sold 1 1
2021 bmw C 3X sold 50 20
2021 bmw A 1X sold 50 20
2022 audi D 4X sold Nan Nan
2022 bmw A 1X prediction 2 1
2022 audi D 4X prediction 7 1
2022 bmw B 2X prediction Nan Nan
2022 bmw C 3X prediction Nan Nan

As you can see, columns year to version are unique (I have more grouped unique columns). The rest are values(i.e. numeric data).

CodePudding user response:

One option is with complete from pyjanitor, to expose the missing rows:

# pip install pyjanitor
import pandas as pd
import janitor

df.complete('version', 'model') 
   model     version  value
0     1X        plan    3.0
1     2X        plan    8.0
2     3X        plan    NaN
3     4X        plan    NaN
4     1X        sold    1.0
5     2X        sold    NaN
6     3X        sold   10.0
7     4X        sold    NaN
8     1X  prediction    2.0
9     2X  prediction    NaN
10    3X  prediction    NaN
11    4X  prediction    7.0

CodePudding user response:

You can pivot, then unstack:

out = (df.pivot(index='model', columns='version', values='value')
         .unstack().reset_index(name='value')
         [df.columns]
      )

Output:

   model     version  value
0     1X        plan    3.0
1     2X        plan    8.0
2     3X        plan    NaN
3     4X        plan    NaN
4     1X  prediction    2.0
5     2X  prediction    NaN
6     3X  prediction    NaN
7     4X  prediction    7.0
8     1X        sold    1.0
9     2X        sold    NaN
10    3X        sold   10.0
11    4X        sold    NaN

A more cumbersome approach (IMO), using a MultiIndex:

idx = pd.MultiIndex.from_product([df['version'].unique(), df['model'].unique()],
                                 names=['version', 'model'])

out = (df.set_index(['version', 'model'])
         .reindex(idx).reset_index()[df.columns]
       )

Output:

   model     version  value
0     1X        plan    3.0
1     2X        plan    8.0
2     3X        plan    NaN
3     4X        plan    NaN
4     1X        sold    1.0
5     2X        sold    NaN
6     3X        sold   10.0
7     4X        sold    NaN
8     1X  prediction    2.0
9     2X  prediction    NaN
10    3X  prediction    NaN
11    4X  prediction    7.0

CodePudding user response:

Another possible solution:

(df.set_index(['model', 'version'])
 .reindex(pd.MultiIndex.from_product(
     [df.model.unique(), df.version.unique()], names=['model', 'version']))
 .reset_index()
 .sort_values(['version', 'model'])
)

Output:

   model     version  value
0     1X        plan    3.0
3     2X        plan    8.0
6     3X        plan    NaN
9     4X        plan    NaN
2     1X  prediction    2.0
5     2X  prediction    NaN
8     3X  prediction    NaN
11    4X  prediction    7.0
1     1X        sold    1.0
4     2X        sold    NaN
7     3X        sold   10.0
10    4X        sold    NaN
  • Related