How to Grouping, Selecting Several Value to be a columns And Select Certain Word in Python-CodePudding

I have A Data like this, or you can see my Notebook here : link or the raw file here : link

Id	Type	Label	Value	Value2
1	A	Introduction	This Project will be created By Mr.X
1	A	Capacity	100MB
1	A	Speed	10Km/h
1	A	Weight	10kg
2	A	Introduction	This-Project-will-be-created-By-Mr.A
2	A	Capacity	100MB
2	A	Speed	5km/h
2	A	Weight	1kg
3	B	Introduction		This Project will be created By Mr.C
3	B	Capacity		100MB
3	B	Speed		5km/h
3	B	Weight		1kg
4	B	Introduction		This Project will be created By Mr.D
4	B	Capacity		100MB
4	B	Speed		5km/h
4	B	Weight		1kg
4	B	Height		1m
4	B	Color		red

You can see that Type A has label value in Value column but the type B has label value in Value2 Column. I want to grouping for each ID and transposing a Label value to be columns like this.

Id	PJ	Capacity	Speed	Weight
1	Mr.X	100MB	10Km/h	10kg
2	Mr.A	100MB	5Km/h	1kg
3	Mr.C	100MB	5Km/h	1kg

Where PJ Column is from the Value of Introduction But we only get the People Name, and my data also have - symbol for several value.

I'm a beginner Using Python and I didn't know how to do. Because I think It's hard if I cleaning the data using excel because there is a lot of data. Thank you

CodePudding user response：

First

fillna Value and use pivot

(df
 .assign(Value=df['Value'].fillna(df['Value2']))
 .pivot('Id', 'Label', 'Value'))

output:

Label   Capacity    Introduction                            Speed   Weight
Id              
1       100MB       This Project will be created By Mr.X    10Km/h  10kg
2       100MB       This-Project-will-be-created-By-Mr.A    5km/h   1kg
3       100MB       This Project will be created By Mr.C    5km/h   1kg

Next

make PJ column and reset_index (full code including First)

(df
 .assign(Value=df['Value'].fillna(df['Value2']))
 .pivot('Id', 'Label', 'Value')
 .assign(PJ=lambda x: x['Introduction'].str.split('By[ -]').str[1])
 .iloc[:, [-1, 0, 2, 3]].reset_index())

output:

Label   Id  PJ      Capacity    Speed   Weight
0       1   Mr.X    100MB       10Km/h  10kg
1       2   Mr.A    100MB       5km/h   1kg
2       3   Mr.C    100MB       5km/h   1kg

CodePudding user response：

You should melt to first reshape the "Value..." columns, then pivot using the new "value" column:

(df.melt(['Id', 'Type', 'Label'])
   .dropna(subset=['value'])
   .pivot(index=['Id', 'Type'], columns='Label', values='value')
   .rename_axis(columns=None)
   .dropna(axis=1) # remove incomplete columns
   .reset_index()
 )

Output:

   Id Type Capacity                          Introduction   Speed Weight
0   1    A    100MB  This Project will be created By Mr.X  10Km/h   10kg
1   2    A    100MB  This-Project-will-be-created-By-Mr.A   5km/h    1kg
2   3    B    100MB  This Project will be created By Mr.C   5km/h    1kg