Home > Back-end >  pandas split column by some specific words and keep this delimiter
pandas split column by some specific words and keep this delimiter

Time:12-31

I want to split the column by some specific words and keep the delimiter in the same time. I tried to split the column with str.split but the result isn't I want.

example data(test.csv):

a
abc123and321abcor213cba
abc321or123cbaand321cba

my code:

import pandas as pd
df = pd.read_csv('test.csv')
df[['b','c']] = df['a'].str.split("and",1,expand=True)
df[['c','d']] = df['c'].str.split("or",1,expand=True)
print(df)

my result:

                         a               b       c       d
0  abc123and321abcor213cba          abc123  321abc  213cba
1  abc321or123cbaand321cba  abc321or123cba  321cba    None

Desired result:

                         a               b         c       d
0  abc123and321abcor213cba       abc123and  321abcor  213cba
1  abc321or123cbaand321cba        abc321or 123cbaand  321cba

How can I do this?

CodePudding user response:

Borrowing from Tim's answer using a lookbehind Regex to split on and or or, without using up the seperating string in the split:

d = {'a': ["abc123and321abcor213cba", "abc321or123cbaand321cba"]}

df = pandas.DataFrame(data=d)

df[["b", "c", "d"]] = df['a'].str.split(r'(?<=and)|(?<=or)', expand=True)

Output:

                         a          b          c       d
0  abc123and321abcor213cba  abc123and   321abcor  213cba
1  abc321or123cbaand321cba   abc321or  123cbaand  321cba

CodePudding user response:

If you have issues with split check that you don't have a too old pandas version.

You could also use str.extractall and unstack. I'll also recommend to usejoin to add the columns if you don't know in advance the number of matches/columns?

df = pd.DataFrame({'a': ["abc123and321abcor213cba", "abc321or123cbaand321cba"]})

df.join(df['a'].str.extractall(r'(.*?(?:and|or)|. $)')[0].unstack('match'))

Output:

                         a          0          1       2
0  abc123and321abcor213cba  abc123and   321abcor  213cba
1  abc321or123cbaand321cba   abc321or  123cbaand  321cba

CodePudding user response:

Try splitting on the lookbehind (?<=and|or):

df[['b', 'c', 'd']] = df['a'].str.split(r'(?<=and)|(?<=or)', 1, expand=True)
  • Related