How to split data by delimiter for corresponding name and value columns-CodePudding

I'm trying to use an excel file to do something which was put together in a rather annoying format (I did not create it; it's an existing resource I'm using). The values of interest are in a column called (something like) All_Values separated by periods, while the measures corresponding to those values are specified in a separate column, All_Measures, also separated by periods and different for each row. For example, using a toy dataset:

Object        All_Measures  All_Values     (additional columns that are not like this)
     1       Height.Weight      20.50      ...
     2       Weight.Height      65.30      ...
     3  Height.Width.Depth   22.30.10      ...

What I want to do is reformat the data like this, filling in the missing values with 0s (the final order of the columns isn't important):

Object  Height  Weight  Width  Depth  (additional columns)
     1      20      50      0      0  ...
     2      30      65      0      0  ...
     3      22       0     30     10  ...

One way I can do this is to (very slowly, as it's a large dataset) create a new blank dataframe, and then iterate over each row in the existing one, create a new dataframe row with the columns specified by splitting All_Measures by ., and the values specified by splitting All_Values by .. Then, I remove All_Measures and All_Values from the row and append the new dataframe to the end of it, and append that to the blank dataframe. But this is pretty clumsy and it'd be nice if there were a faster and more elegant way to do it.

Since there's no error here, I don't have a MWE, but here's some code one could copy to create a toy dataset like the above in case it comes in handy.

df = pd.DataFrame(
    columns = ['Object','All_Measures','All_Values','Object_Name']
    [[1,'Height.Weight','20.50','First'], 
     [2,'Weight.Height','65.30','Second'], 
     [3,'Height.Width.Depth','22.30.10','Third']]
)

CodePudding user response：

Use str.split, explode, and pivot_table:

# split the "All" columns into lists
df['All_Measures'] = df['All_Measures'].str.split('.')
df['All_Values'] = df['All_Values'].str.split('.')

# explode the lists into rows
df = df.explode(['All_Measures', 'All_Values'])

# pivot the measures into columns
df.pivot_table(
    index=['Object', 'Object_Name'],
    columns='All_Measures',
    values='All_Values',
    fill_value=0)

Output:

All_Measures       Depth Height Weight Width
Object Object_Name                          
1      First           0     20     50     0
2      Second          0     30     65     0
3      Third          10     22      0    30

Detailed breakdown

str.split the "All" columns into lists:

df['All_Measures'] = df['All_Measures'].str.split('.')
df['All_Values'] = df['All_Values'].str.split('.')

#    Object            All_Measures    All_Values Object_Name
# 0       1        [Height, Weight]      [20, 50]       First
# 1       2        [Weight, Height]      [65, 30]      Second
# 2       3  [Height, Width, Depth]  [22, 30, 10]       Third

explode the lists into rows:

df = df.explode(['All_Measures', 'All_Values'])

#    Object All_Measures All_Values Object_Name
# 0       1       Height         20       First
# 0       1       Weight         50       First
# 1       2       Weight         65      Second
# 1       2       Height         30      Second
# 2       3       Height         22       Third
# 2       3        Width         30       Third
# 2       3        Depth         10       Third

pivot_table the measures into columns:

df.pivot_table(
    index=['Object', 'Object_Name'],
    columns='All_Measures',
    values='All_Values',
    fill_value=0)

# All_Measures       Depth Height Weight Width
# Object Object_Name                          
# 1      First           0     20     50     0
# 2      Second          0     30     65     0
# 3      Third          10     22      0    30

CodePudding user response：

There's probably some way to accomplish this without using loops or apply(), but I can't think of it. Here's what comes to mind:

import pandas as pd
df = pd.DataFrame(
    [[1,'Height.Weight','20.50','First'], 
     [2,'Weight.Height','65.30','Second'], 
     [3,'Height.Width.Depth','22.30.10','Third']],
    columns = ['Object','All_Measures','All_Values','Object_Name'],
)

def parse_combined_measure(row):
    keys = row["All_Measures"].split(".")
    values = row["All_Values"].split(".")
    return row.append(pd.Series(dict(zip(keys, values))))

df2 = df.apply(parse_combined_measure, axis=1)
df2 = df2.fillna(0)

CodePudding user response：

# Create a new DataFrame with just the values extracted from the All_Values column
In [24]: new_df = df['All_Values'].str.split('.').apply(pd.Series)
Out[24]:
    0   1    2
0  20  50  NaN
1  65  30  NaN
2  22  30   10

# Figure out the names those columns should have
In [37]: df.loc[df['All_Measures'].str.count('\.').idxmax(), 'All_Measures']
Out[37]: 'Height.Width.Depth'

In [38]: new_df.columns = df.loc[df['All_Measures'].str.count('\.').idxmax(), 'All_Measures'].split('.')
Out[39]:
  Height Width Depth
0     20    50   NaN
1     65    30   NaN
2     22    30    10

# Join the new DF with the original, except the columns we've expanded.
In [41]: df[['Object', 'Object_Name']].join(new_df)
Out[41]:
   Object Object_Name Height Width Depth
0       1       First     20    50   NaN
1       2      Second     65    30   NaN
2       3       Third     22    30    10