I am trying to split a column and store the part after the last "_" as a new column.
import pandas as pd
import numpy as np
names= ['John', 'Jane', 'Brian','Suzan', 'John']
expertise = ['primary_chat', 'follow_email', 'repeat_chat', 'primary_video_chat', 'tech_chat']
data = list(zip(names,expertise))
df = pd.DataFrame(data, columns=['Name', 'Communication'])
df
Output
Name Communication
0 John primary_chat
1 Jane follow_email
2 Brian repeat_chat
3 Suzan primary_video_chat
4 John tech_chat
When I add a new column by splitting the column:
df['Platform'] = df['Communication'].str.split('_', expand=True)[1]
df
Output
Name Communication Platform
0 John primary_chat chat
1 Jane follow_email email
2 Brian repeat_chat chat
3 Suzan primary_video_chat video
4 John tech_chat chat
But the problem is that, [1] takes the 2nd part of the split. This is not a problem when we only have one "_", the 2nd part is what we need. But when you have 2 "_"s like the 3rd one (Suzan), the [1] gets you the phrase "video" not the "email", we should have [2] indexing there.
We can dynamically get the number of "_"s and use this value but, the below code even though it outputs correct values, when I use it in [] as an index value I get an error.
df['Communication'].str.count('_')
0 1
1 1
2 1
3 2
4 1
Name: Communication, dtype: int64
gives me the correct number of "_". But when I try to use it in the previous line of code where I use split() and create the new column, I get an error
df['Platform'] = df['Communication'].str.split('_', expand=True)[df['Agent Expertise'].str.count('_')]
But I am getting error..
Maybe I should try to use apply() and lambda but I wonder if there is a way to fix this one..
CodePudding user response:
You can use a regular expression that looks for all characters but _
at the end of string (represented by $
):
df['Platform'] = df['Communication'].str.extract('([^_] )$')
CodePudding user response:
You can use str.rsplit
and limit the number of splits to 1:
df['Platform'] = df['Communication'].str.rsplit('_', n=1).str[1]
print(df)
# Output
Name Communication Platform
0 John primary_chat chat
1 Jane follow_email email
2 Brian repeat_chat chat
3 Suzan primary_video_chat chat
4 John tech_chat chat