I would like to join my dataframe with itself in a way that it has the same amount of rows for a particular column. It sounds a bit complicated but I believe it is not when you see it. So here is an example:
year | brand | series | model | version | value | value 2 |
---|---|---|---|---|---|---|
2022 | bmw | A | 1X | plan | 3 | 1 |
2022 | bmw | B | 2X | plan | 8 | 1 |
2022 | bmw | A | 1X | sold | 1 | 1 |
2022 | bmw | C | 3X | sold | 10 | 1 |
2021 | bmw | A | 1X | sold | 50 | 20 |
2021 | bmw | C | 3X | sold | 50 | 20 |
2022 | bmw | A | 1X | prediction | 2 | 1 |
2022 | audi | D | 4X | prediction | 7 | 1 |
I would like to have the column version
the same amount of model
always, like this:
year | brand | series | model | version | value | value 2 |
---|---|---|---|---|---|---|
2022 | bmw | A | 1X | plan | 3 | 1 |
2022 | bmw | B | 2X | plan | 8 | 1 |
2022 | bmw | C | 3X | plan | Nan | Nan |
2022 | audi | D | 4X | plan | Nan | Nan |
2022 | bmw | A | 1X | sold | 1 | 1 |
2022 | bmw | C | 3X | sold | 10 | 1 |
2022 | bmw | A | 1X | sold | 1 | 1 |
2021 | bmw | C | 3X | sold | 50 | 20 |
2021 | bmw | A | 1X | sold | 50 | 20 |
2022 | audi | D | 4X | sold | Nan | Nan |
2022 | bmw | A | 1X | prediction | 2 | 1 |
2022 | audi | D | 4X | prediction | 7 | 1 |
2022 | bmw | B | 2X | prediction | Nan | Nan |
2022 | bmw | C | 3X | prediction | Nan | Nan |
As you can see, columns year
to version
are unique (I have more grouped unique columns). The rest are values(i.e. numeric data).
CodePudding user response:
One option is with complete from pyjanitor, to expose the missing rows:
# pip install pyjanitor
import pandas as pd
import janitor
df.complete('version', 'model')
model version value
0 1X plan 3.0
1 2X plan 8.0
2 3X plan NaN
3 4X plan NaN
4 1X sold 1.0
5 2X sold NaN
6 3X sold 10.0
7 4X sold NaN
8 1X prediction 2.0
9 2X prediction NaN
10 3X prediction NaN
11 4X prediction 7.0
CodePudding user response:
You can pivot
, then unstack
:
out = (df.pivot(index='model', columns='version', values='value')
.unstack().reset_index(name='value')
[df.columns]
)
Output:
model version value
0 1X plan 3.0
1 2X plan 8.0
2 3X plan NaN
3 4X plan NaN
4 1X prediction 2.0
5 2X prediction NaN
6 3X prediction NaN
7 4X prediction 7.0
8 1X sold 1.0
9 2X sold NaN
10 3X sold 10.0
11 4X sold NaN
A more cumbersome approach (IMO), using a MultiIndex:
idx = pd.MultiIndex.from_product([df['version'].unique(), df['model'].unique()],
names=['version', 'model'])
out = (df.set_index(['version', 'model'])
.reindex(idx).reset_index()[df.columns]
)
Output:
model version value
0 1X plan 3.0
1 2X plan 8.0
2 3X plan NaN
3 4X plan NaN
4 1X sold 1.0
5 2X sold NaN
6 3X sold 10.0
7 4X sold NaN
8 1X prediction 2.0
9 2X prediction NaN
10 3X prediction NaN
11 4X prediction 7.0
CodePudding user response:
Another possible solution:
(df.set_index(['model', 'version'])
.reindex(pd.MultiIndex.from_product(
[df.model.unique(), df.version.unique()], names=['model', 'version']))
.reset_index()
.sort_values(['version', 'model'])
)
Output:
model version value
0 1X plan 3.0
3 2X plan 8.0
6 3X plan NaN
9 4X plan NaN
2 1X prediction 2.0
5 2X prediction NaN
8 3X prediction NaN
11 4X prediction 7.0
1 1X sold 1.0
4 2X sold NaN
7 3X sold 10.0
10 4X sold NaN