With an original csv file, I import following df
import pandas as pd
import numpy as np
# assign data of lists.
data = {'INTERVAL': ['100 A', '100 A or 20 B', '100 A or 20 B or 3 C','5 C']}
# Create DataFrame
df = pd.DataFrame(data)
# Print the output.
print(df)
My objective is to split the content of the cell INTERVAL into specific columns depending on their units just like this:
# Objective
data = {'INTERVAL': ['100 A', '100 A or 20 B', '100 A or 20 B or 3 C','5 C'],'INTERVAL_A': ['100', '100', '100',np.nan],'INTERVAL_A': ['100', '100', '100',np.nan],'INTERVAL_B': [np.nan, '20', np.nan, np.nan],'INTERVAL_C': [np.nan, np.nan, '3','5']}
# Create DataFrame
df = pd.DataFrame(data)
# Print the output.
print(df)
I am able to split the interval column and assign the content to additional coloumns, this approach fails when there are different arrangement of the of INTERVAL columns, refer to the last line of following code snippet.
# Split Interval
A0 = df['INTERVAL'].str.split(pat="or",expand=True, n=-1)
df['INTERVAL_X1'] = A0.loc[:,0] # Assign
df['INTERVAL_X2'] = A0.loc[:,1]
df['INTERVAL_X3'] = A0.loc[:,2]
So what is a good approach to loop over the content of the INTERVAL_X coloumns and to reassign the content depending on the unit? An addition question would be how to isolate just the value since the label of the column contains information about the unit?
Thank you all in advance
Expected output:
INTERVAL INTERVAL_A INTERVAL_B INTERVAL_C
0 100 A 100 NaN NaN
1 100 A or 20 B 100 20 NaN
2 100 A or 20 B or 3 C 100 NaN 3
3 5 C NaN NaN 5
CodePudding user response:
You can use a regex to match numbers followed by space and a capital letter and str.extractall
. Then reshape your data and finally join
to the original dataframe:
df2 = (df['INTERVAL'].str.extractall('(?P<INTERVAL>\d ) (?P<ID>[A-Z])')
.droplevel(1)
.set_index('ID', append=True)
.unstack('ID')
)
df2.columns = df2.columns.map('_'.join)
df.join(df2)
output:
INTERVAL INTERVAL_A INTERVAL_B INTERVAL_C
0 100 A 100 NaN NaN
1 100 A or 20 B 100 20 NaN
2 100 A or 20 B or 3 C 100 20 3
3 5 C NaN NaN 5
fine tuning
If you have longer identifier (e.g., A/AB/GHI), use: '(?P<INTERVAL>\d ) (?P<ID>[A-Z] )'
.
If you have optional or multiple spaces: '(?P<INTERVAL>\d )\s*(?P<ID>[A-Z] )'
CodePudding user response:
Inspired by @mozway's answer:
df.join(
df['INTERVAL'] # Select column to extract info from
.str.extractall('(?P<INTERVAL>\d ) (?P<ID>[A-Z])') # Extract INTERVAL and ID as different columns
.pivot(columns="ID") # Use values of ID column as columns
.droplevel(0, axis=1) # Drop original column name from columns' levels
.sum(level=0) # Collapse values to ID col
)