Split string in data frame depending on units and assign content to specific coloumns-CodePudding

With an original csv file, I import following df

import pandas as pd  
import numpy as np

# assign data of lists.  
data = {'INTERVAL': ['100 A', '100 A or 20 B', '100 A or 20 B or 3 C','5 C']}     
# Create DataFrame  
df = pd.DataFrame(data)       
# Print the output.  
print(df)

My objective is to split the content of the cell INTERVAL into specific columns depending on their units just like this:

# Objective 
data = {'INTERVAL': ['100 A', '100 A or 20 B', '100 A or 20 B or 3 C','5 C'],'INTERVAL_A': ['100', '100', '100',np.nan],'INTERVAL_A': ['100', '100', '100',np.nan],'INTERVAL_B': [np.nan, '20', np.nan, np.nan],'INTERVAL_C': [np.nan, np.nan, '3','5']}     
# Create DataFrame  
df = pd.DataFrame(data)       
# Print the output.  
print(df)

I am able to split the interval column and assign the content to additional coloumns, this approach fails when there are different arrangement of the of INTERVAL columns, refer to the last line of following code snippet.

# Split Interval
A0 = df['INTERVAL'].str.split(pat="or",expand=True, n=-1)
df['INTERVAL_X1'] = A0.loc[:,0] # Assign
df['INTERVAL_X2'] = A0.loc[:,1]
df['INTERVAL_X3'] = A0.loc[:,2]

So what is a good approach to loop over the content of the INTERVAL_X coloumns and to reassign the content depending on the unit? An addition question would be how to isolate just the value since the label of the column contains information about the unit?

Thank you all in advance

Expected output:

               INTERVAL INTERVAL_A INTERVAL_B INTERVAL_C
0                 100 A        100        NaN        NaN
1         100 A or 20 B        100         20        NaN
2  100 A or 20 B or 3 C        100        NaN          3
3                   5 C        NaN        NaN          5

CodePudding user response：

You can use a regex to match numbers followed by space and a capital letter and str.extractall. Then reshape your data and finally join to the original dataframe:

df2 = (df['INTERVAL'].str.extractall('(?P<INTERVAL>\d ) (?P<ID>[A-Z])')
      .droplevel(1)
      .set_index('ID', append=True)
      .unstack('ID')
      )

df2.columns = df2.columns.map('_'.join)

df.join(df2)

output:

               INTERVAL INTERVAL_A INTERVAL_B INTERVAL_C
0                 100 A        100        NaN        NaN
1         100 A or 20 B        100         20        NaN
2  100 A or 20 B or 3 C        100         20          3
3                   5 C        NaN        NaN          5

fine tuning

If you have longer identifier (e.g., A/AB/GHI), use: '(?P<INTERVAL>\d ) (?P<ID>[A-Z] )'.

If you have optional or multiple spaces: '(?P<INTERVAL>\d )\s*(?P<ID>[A-Z] )'

CodePudding user response：

Inspired by @mozway's answer:

df.join(
    df['INTERVAL']                                     # Select column to extract info from
    .str.extractall('(?P<INTERVAL>\d ) (?P<ID>[A-Z])') # Extract INTERVAL and ID as different columns
    .pivot(columns="ID")                               # Use values of ID column as columns
    .droplevel(0, axis=1)                              # Drop original column name from columns' levels
    .sum(level=0)                                      # Collapse values to ID col
)