I understand how to split the string from the first occurrence of a whitespace. My question is how to split on the second third occurrence of the whitespace and capture all the string before that.
df = pd.DataFrame({"cid" : {0 : "cd1", 1 : "cd2", 2 : "cd3"},
"Name" : {0 : "John Maike Leiws", 1 : "Katie Sue Adam", 2 : "Tanaka Ubri Kse Suri"}}).set_index(['cid'])
Name
cid
cd1 John Maike Leiws
cd2 Katie Sue Adam
cd3 Tanaka Ubri Kse Suri
df['split_one'] = df.Name.str.split().str[0]
Expected output:
Name split_one split_two split_three
cid
cd1 John Maike Leiws John John Maike John Maike Leiws
cd2 Katie Sue Adam Katie Katie Sue Katie Sue Adam
cd3 Tanaka Ubri Kse Suri Tanaka Tanaka Ubri Tanaka Ubri Kse
CodePudding user response:
Use indexing with str
and then Series.str.join
:
s = df.Name.str.split()
df['split_one'] = s.str[0]
df['split_two'] = s.str[:2].str.join(' ')
df['split_three'] = s.str[:3].str.join(' ')
print (df)
Name split_one split_two split_three
cid
cd1 John Maike Leiws John John Maike John Maike Leiws
cd2 Katie Sue Adam Katie Katie Sue Katie Sue Adam
cd3 Tanaka Ubri Kse Suri Tanaka Tanaka Ubri Tanaka Ubri Kse
CodePudding user response:
An easy way with a regex is to use nested capturing groups:
df['Name'].str.extract('(((\S )\s\S )\s\S )').iloc[:,::-1]
output:
0 1 2
cid
cd1 John Maike Leiws John Maike John
cd2 Katie Sue Adam Katie Sue Katie
cd3 Tanaka Ubri Kse Tanaka Ubri Tanaka
To add, just reverse the order:
df[['split_one', 'split_two', 'split_three']] = df['Name'].str.extract('(((\S )\s\S )\s\S )').iloc[:,::-1]
output:
Name split_one split_two split_three
cid
cd1 John Maike Leiws John John Maike John Maike Leiws
cd2 Katie Sue Adam Katie Katie Sue Katie Sue Adam
cd3 Tanaka Ubri Kse Suri Tanaka Tanaka Ubri Tanaka Ubri Kse
CodePudding user response:
I don't know if you are looking for something generic or something simple. This is one simple way to do it.
df = pd.DataFrame({"cid" : {0 : "cd1", 1 : "cd2", 2 : "cd3"},
"Name" : {0 : "John Maike Leiws", 1 : "Katie Sue Adam", 2 : "Tanaka Ubri Kse Suri"}}).set_index(['cid'])
s = df.Name.str.split().str
df['split_one'] = s[0]
df['split_two'] = s[0] ' ' s[1]
df['split_three'] = s[0] ' ' s[1] ' ' s[2]