Home > Net >  Dynamically Splitting Column of a DataFrame and Store it as a New Column
Dynamically Splitting Column of a DataFrame and Store it as a New Column

Time:05-10

I am trying to split a column and store the part after the last "_" as a new column.

import pandas as pd
import numpy as np
names= ['John', 'Jane', 'Brian','Suzan', 'John']
expertise = ['primary_chat', 'follow_email', 'repeat_chat', 'primary_video_chat', 'tech_chat']

data  = list(zip(names,expertise))
df = pd.DataFrame(data, columns=['Name', 'Communication'])
df

Output

    Name       Communication
0   John        primary_chat
1   Jane        follow_email
2  Brian         repeat_chat
3  Suzan  primary_video_chat
4   John           tech_chat

When I add a new column by splitting the column:

df['Platform'] = df['Communication'].str.split('_', expand=True)[1]
df

Output

    Name       Communication Platform
0   John        primary_chat     chat
1   Jane        follow_email    email
2  Brian         repeat_chat     chat
3  Suzan  primary_video_chat    video
4   John           tech_chat     chat

But the problem is that, [1] takes the 2nd part of the split. This is not a problem when we only have one "_", the 2nd part is what we need. But when you have 2 "_"s like the 3rd one (Suzan), the [1] gets you the phrase "video" not the "email", we should have [2] indexing there.

We can dynamically get the number of "_"s and use this value but, the below code even though it outputs correct values, when I use it in [] as an index value I get an error.

df['Communication'].str.count('_')
0    1
1    1
2    1
3    2
4    1
Name: Communication, dtype: int64

gives me the correct number of "_". But when I try to use it in the previous line of code where I use split() and create the new column, I get an error

df['Platform'] = df['Communication'].str.split('_', expand=True)[df['Agent Expertise'].str.count('_')]

But I am getting error..

Maybe I should try to use apply() and lambda but I wonder if there is a way to fix this one..

CodePudding user response:

You can use a regular expression that looks for all characters but _ at the end of string (represented by $):

df['Platform'] = df['Communication'].str.extract('([^_] )$')

CodePudding user response:

You can use str.rsplit and limit the number of splits to 1:

df['Platform'] = df['Communication'].str.rsplit('_', n=1).str[1]
print(df)

# Output
    Name       Communication Platform
0   John        primary_chat     chat
1   Jane        follow_email    email
2  Brian         repeat_chat     chat
3  Suzan  primary_video_chat     chat
4   John           tech_chat     chat
  • Related