I want to split the column by some specific words and keep the delimiter in the same time. I tried to split the column with str.split
but the result isn't I want.
example data(test.csv):
a
abc123and321abcor213cba
abc321or123cbaand321cba
my code:
import pandas as pd
df = pd.read_csv('test.csv')
df[['b','c']] = df['a'].str.split("and",1,expand=True)
df[['c','d']] = df['c'].str.split("or",1,expand=True)
print(df)
my result:
a b c d
0 abc123and321abcor213cba abc123 321abc 213cba
1 abc321or123cbaand321cba abc321or123cba 321cba None
Desired result:
a b c d
0 abc123and321abcor213cba abc123and 321abcor 213cba
1 abc321or123cbaand321cba abc321or 123cbaand 321cba
How can I do this?
CodePudding user response:
Borrowing from Tim's answer using a lookbehind Regex to split on and
or or
, without using up the seperating string in the split:
d = {'a': ["abc123and321abcor213cba", "abc321or123cbaand321cba"]}
df = pandas.DataFrame(data=d)
df[["b", "c", "d"]] = df['a'].str.split(r'(?<=and)|(?<=or)', expand=True)
Output:
a b c d
0 abc123and321abcor213cba abc123and 321abcor 213cba
1 abc321or123cbaand321cba abc321or 123cbaand 321cba
CodePudding user response:
If you have issues with split
check that you don't have a too old pandas version.
You could also use str.extractall
and unstack
. I'll also recommend to usejoin
to add the columns if you don't know in advance the number of matches/columns?
df = pd.DataFrame({'a': ["abc123and321abcor213cba", "abc321or123cbaand321cba"]})
df.join(df['a'].str.extractall(r'(.*?(?:and|or)|. $)')[0].unstack('match'))
Output:
a 0 1 2
0 abc123and321abcor213cba abc123and 321abcor 213cba
1 abc321or123cbaand321cba abc321or 123cbaand 321cba
CodePudding user response:
Try splitting on the lookbehind (?<=and|or)
:
df[['b', 'c', 'd']] = df['a'].str.split(r'(?<=and)|(?<=or)', 1, expand=True)