I have a dataframe like below:
ID TYPE SN Notes
0 01 Lorem Ipsum
1 02 apple aa11 Dummy text
2 02 banana ab12 Dummy text
3 03 orange ad04 Random text
4 04 Latin words
5 05 apple ac03 Randomised words
6 05 banana ac04 Randomised words
7 05 orange aa41 Randomised words
8 05 cherry af12 Randomised words
9 06 apple aa32 Dolorem Ipsum
There are rows with same ID and same values (e.g. Notes
column) except TYPE
and SN
columns, which are sometimes empty and sometimes not.
I want to change my existing dataframe to group these rows into one row like below:
ID TYPE_1 TYPE_2 TYPE_3 TYPE_4 SN_1 SN_2 SN_3 SN_4 Count Notes
0 01 0 Lorem Ipsum
1 02 apple banana aa11 ab12 2 Dummy text
2 03 orange ad04 1 Random text
3 04 0 Latin words
4 05 apple banana orange cherry ac03 ac04 aa41 af12 4 Randomised words
5 06 apple aa32 1 Dolorem Ipsum
I should group the dataframe by ID
, but then what? There can be any number of rows with same ID in different dataframes so I can't know maximum number of rows with same ID
and create these columns beforehand. How can I achieve this?
CodePudding user response:
Assuming the empty cells are NaN, you can use:
g = df.groupby('ID')
out = (df
.assign(col=g.cumcount().add(1).astype(str),
Count=g['TYPE'].transform('count')
)
.pivot(index=['ID', 'Notes', 'Count'], columns='col')
.pipe(lambda d: d.set_axis(d.columns.map('_'.join), axis=1))
.reset_index()
)
Output:
ID Notes Count TYPE_1 TYPE_2 TYPE_3 TYPE_4 SN_1 SN_2 SN_3 SN_4
0 1 Lorem Ipsum 0 NaN NaN NaN NaN NaN NaN NaN NaN
1 2 Dummy text 2 apple banana NaN NaN aa11 ab12 NaN NaN
2 3 Random text 1 orange NaN NaN NaN ad04 NaN NaN NaN
3 4 Latin words 0 NaN NaN NaN NaN NaN NaN NaN NaN
4 5 Randomised words 4 apple banana orange cherry ac03 ac04 aa41 af12
5 6 Dolorem Ipsum 1 apple NaN NaN NaN aa32 NaN NaN NaN
CodePudding user response:
You can aggregate by '|' and then use str.split to get multiple columns and then concat the final result together:
out = (
df.groupby("ID")
.agg(
TYPE=("TYPE", "|".join),
SN=("SN", "|".join),
Notes=("Notes", "first"),
Count=("ID", "count"),
)
.reset_index()
)
types = (
out["TYPE"]
.str.split("|", expand=True)
.fillna("")
.rename(lambda x: f"TYPE_{x 1}", axis=1)
)
sns = (
out["SN"]
.str.split("|", expand=True)
.fillna("")
.rename(lambda x: f"SN_{x 1}", axis=1)
)
out = pd.concat([out.drop(["TYPE", "SN"], axis=1), types, sns], axis=1)
print(out):
ID Notes Count TYPE_1 TYPE_2 TYPE_3 TYPE_4 SN_1 SN_2 \
0 01 Lorem Ipsum 1
1 02 Dummy text 2 apple banana aa11 ab12
2 03 Random text 1 orange ad04
3 04 Latin words 1
4 05 Randomised words 4 apple banana orange cherry ac03 ac04
5 06 Dolorem Ipsum 1 apple aa32
SN_3 SN_4
0
1
2
3
4 aa41 af12
5