I'm trying to use an excel file to do something which was put together in a rather annoying format (I did not create it; it's an existing resource I'm using). The values of interest are in a column called (something like) All_Values
separated by periods, while the measures corresponding to those values are specified in a separate column, All_Measures
, also separated by periods and different for each row. For example, using a toy dataset:
Object All_Measures All_Values (additional columns that are not like this)
1 Height.Weight 20.50 ...
2 Weight.Height 65.30 ...
3 Height.Width.Depth 22.30.10 ...
What I want to do is reformat the data like this, filling in the missing values with 0s (the final order of the columns isn't important):
Object Height Weight Width Depth (additional columns)
1 20 50 0 0 ...
2 30 65 0 0 ...
3 22 0 30 10 ...
One way I can do this is to (very slowly, as it's a large dataset) create a new blank dataframe, and then iterate over each row in the existing one, create a new dataframe row with the columns specified by splitting All_Measures
by .
, and the values specified by splitting All_Values
by .
. Then, I remove All_Measures
and All_Values
from the row and append the new dataframe to the end of it, and append that to the blank dataframe. But this is pretty clumsy and it'd be nice if there were a faster and more elegant way to do it.
Since there's no error here, I don't have a MWE, but here's some code one could copy to create a toy dataset like the above in case it comes in handy.
df = pd.DataFrame(
columns = ['Object','All_Measures','All_Values','Object_Name']
[[1,'Height.Weight','20.50','First'],
[2,'Weight.Height','65.30','Second'],
[3,'Height.Width.Depth','22.30.10','Third']]
)
CodePudding user response:
Use str.split
, explode
, and pivot_table
:
# split the "All" columns into lists
df['All_Measures'] = df['All_Measures'].str.split('.')
df['All_Values'] = df['All_Values'].str.split('.')
# explode the lists into rows
df = df.explode(['All_Measures', 'All_Values'])
# pivot the measures into columns
df.pivot_table(
index=['Object', 'Object_Name'],
columns='All_Measures',
values='All_Values',
fill_value=0)
Output:
All_Measures Depth Height Weight Width
Object Object_Name
1 First 0 20 50 0
2 Second 0 30 65 0
3 Third 10 22 0 30
Detailed breakdown
str.split
the "All" columns into lists:df['All_Measures'] = df['All_Measures'].str.split('.') df['All_Values'] = df['All_Values'].str.split('.') # Object All_Measures All_Values Object_Name # 0 1 [Height, Weight] [20, 50] First # 1 2 [Weight, Height] [65, 30] Second # 2 3 [Height, Width, Depth] [22, 30, 10] Third
explode
the lists into rows:df = df.explode(['All_Measures', 'All_Values']) # Object All_Measures All_Values Object_Name # 0 1 Height 20 First # 0 1 Weight 50 First # 1 2 Weight 65 Second # 1 2 Height 30 Second # 2 3 Height 22 Third # 2 3 Width 30 Third # 2 3 Depth 10 Third
pivot_table
the measures into columns:df.pivot_table( index=['Object', 'Object_Name'], columns='All_Measures', values='All_Values', fill_value=0) # All_Measures Depth Height Weight Width # Object Object_Name # 1 First 0 20 50 0 # 2 Second 0 30 65 0 # 3 Third 10 22 0 30
CodePudding user response:
There's probably some way to accomplish this without using loops or apply(), but I can't think of it. Here's what comes to mind:
import pandas as pd
df = pd.DataFrame(
[[1,'Height.Weight','20.50','First'],
[2,'Weight.Height','65.30','Second'],
[3,'Height.Width.Depth','22.30.10','Third']],
columns = ['Object','All_Measures','All_Values','Object_Name'],
)
def parse_combined_measure(row):
keys = row["All_Measures"].split(".")
values = row["All_Values"].split(".")
return row.append(pd.Series(dict(zip(keys, values))))
df2 = df.apply(parse_combined_measure, axis=1)
df2 = df2.fillna(0)
CodePudding user response:
# Create a new DataFrame with just the values extracted from the All_Values column
In [24]: new_df = df['All_Values'].str.split('.').apply(pd.Series)
Out[24]:
0 1 2
0 20 50 NaN
1 65 30 NaN
2 22 30 10
# Figure out the names those columns should have
In [37]: df.loc[df['All_Measures'].str.count('\.').idxmax(), 'All_Measures']
Out[37]: 'Height.Width.Depth'
In [38]: new_df.columns = df.loc[df['All_Measures'].str.count('\.').idxmax(), 'All_Measures'].split('.')
Out[39]:
Height Width Depth
0 20 50 NaN
1 65 30 NaN
2 22 30 10
# Join the new DF with the original, except the columns we've expanded.
In [41]: df[['Object', 'Object_Name']].join(new_df)
Out[41]:
Object Object_Name Height Width Depth
0 1 First 20 50 NaN
1 2 Second 65 30 NaN
2 3 Third 22 30 10