I have A Data like this, or you can see my Notebook here : link or the raw file here : link
Id | Type | Label | Value | Value2 |
---|---|---|---|---|
1 | A | Introduction | This Project will be created By Mr.X | |
1 | A | Capacity | 100MB | |
1 | A | Speed | 10Km/h | |
1 | A | Weight | 10kg | |
2 | A | Introduction | This-Project-will-be-created-By-Mr.A | |
2 | A | Capacity | 100MB | |
2 | A | Speed | 5km/h | |
2 | A | Weight | 1kg | |
3 | B | Introduction | This Project will be created By Mr.C | |
3 | B | Capacity | 100MB | |
3 | B | Speed | 5km/h | |
3 | B | Weight | 1kg | |
4 | B | Introduction | This Project will be created By Mr.D | |
4 | B | Capacity | 100MB | |
4 | B | Speed | 5km/h | |
4 | B | Weight | 1kg | |
4 | B | Height | 1m | |
4 | B | Color | red |
You can see that Type A has label value in Value
column but the type B has label value in Value2
Column. I want to grouping for each ID and transposing a Label
value to be columns like this.
Id | PJ | Capacity | Speed | Weight |
---|---|---|---|---|
1 | Mr.X | 100MB | 10Km/h | 10kg |
2 | Mr.A | 100MB | 5Km/h | 1kg |
3 | Mr.C | 100MB | 5Km/h | 1kg |
Where PJ
Column is from the Value of Introduction
But we only get the People Name, and my data also have -
symbol for several value.
I'm a beginner Using Python and I didn't know how to do. Because I think It's hard if I cleaning the data using excel because there is a lot of data. Thank you
CodePudding user response:
First
fillna Value
and use pivot
(df
.assign(Value=df['Value'].fillna(df['Value2']))
.pivot('Id', 'Label', 'Value'))
output:
Label Capacity Introduction Speed Weight
Id
1 100MB This Project will be created By Mr.X 10Km/h 10kg
2 100MB This-Project-will-be-created-By-Mr.A 5km/h 1kg
3 100MB This Project will be created By Mr.C 5km/h 1kg
Next
make PJ
column and reset_index
(full code including First)
(df
.assign(Value=df['Value'].fillna(df['Value2']))
.pivot('Id', 'Label', 'Value')
.assign(PJ=lambda x: x['Introduction'].str.split('By[ -]').str[1])
.iloc[:, [-1, 0, 2, 3]].reset_index())
output:
Label Id PJ Capacity Speed Weight
0 1 Mr.X 100MB 10Km/h 10kg
1 2 Mr.A 100MB 5km/h 1kg
2 3 Mr.C 100MB 5km/h 1kg
CodePudding user response:
You should melt
to first reshape the "Value..." columns, then pivot
using the new "value" column:
(df.melt(['Id', 'Type', 'Label'])
.dropna(subset=['value'])
.pivot(index=['Id', 'Type'], columns='Label', values='value')
.rename_axis(columns=None)
.dropna(axis=1) # remove incomplete columns
.reset_index()
)
Output:
Id Type Capacity Introduction Speed Weight
0 1 A 100MB This Project will be created By Mr.X 10Km/h 10kg
1 2 A 100MB This-Project-will-be-created-By-Mr.A 5km/h 1kg
2 3 B 100MB This Project will be created By Mr.C 5km/h 1kg