Home > Software design >  Capture all the string before the 2nd and 3rd whitespace in Pandas
Capture all the string before the 2nd and 3rd whitespace in Pandas

Time:02-19

I understand how to split the string from the first occurrence of a whitespace. My question is how to split on the second third occurrence of the whitespace and capture all the string before that.

df = pd.DataFrame({"cid" : {0 : "cd1", 1 : "cd2", 2 : "cd3"},
                   "Name" : {0 : "John Maike Leiws", 1 : "Katie Sue Adam", 2 : "Tanaka Ubri Kse Suri"}}).set_index(['cid'])

                     Name
cid
cd1      John Maike Leiws
cd2        Katie Sue Adam
cd3  Tanaka Ubri Kse Suri

df['split_one'] = df.Name.str.split().str[0]

Expected output:

                     Name  split_one   split_two   split_three
cid
cd1      John Maike Leiws      John    John Maike  John Maike Leiws
cd2        Katie Sue Adam     Katie    Katie Sue   Katie Sue Adam
cd3  Tanaka Ubri Kse Suri    Tanaka    Tanaka Ubri Tanaka Ubri Kse

CodePudding user response:

Use indexing with str and then Series.str.join:

s = df.Name.str.split()
df['split_one'] = s.str[0]
df['split_two'] = s.str[:2].str.join(' ')
df['split_three'] = s.str[:3].str.join(' ')
print (df)
                     Name split_one    split_two       split_three
cid                                                               
cd1      John Maike Leiws      John   John Maike  John Maike Leiws
cd2        Katie Sue Adam     Katie    Katie Sue    Katie Sue Adam
cd3  Tanaka Ubri Kse Suri    Tanaka  Tanaka Ubri   Tanaka Ubri Kse

CodePudding user response:

An easy way with a regex is to use nested capturing groups:

df['Name'].str.extract('(((\S )\s\S )\s\S )').iloc[:,::-1]

output:

                    0            1       2
cid                                       
cd1  John Maike Leiws   John Maike    John
cd2    Katie Sue Adam    Katie Sue   Katie
cd3   Tanaka Ubri Kse  Tanaka Ubri  Tanaka

To add, just reverse the order:

df[['split_one', 'split_two', 'split_three']] = df['Name'].str.extract('(((\S )\s\S )\s\S )').iloc[:,::-1]

output:

                     Name split_one    split_two       split_three
cid                                                               
cd1      John Maike Leiws      John   John Maike  John Maike Leiws
cd2        Katie Sue Adam     Katie    Katie Sue    Katie Sue Adam
cd3  Tanaka Ubri Kse Suri    Tanaka  Tanaka Ubri   Tanaka Ubri Kse

CodePudding user response:

I don't know if you are looking for something generic or something simple. This is one simple way to do it.

 df = pd.DataFrame({"cid" : {0 : "cd1", 1 : "cd2", 2 : "cd3"},
                       "Name" : {0 : "John Maike Leiws", 1 : "Katie Sue Adam", 2 : "Tanaka Ubri Kse Suri"}}).set_index(['cid'])
    
    s = df.Name.str.split().str
    df['split_one'] = s[0]
    df['split_two'] = s[0]   ' '   s[1]
    df['split_three'] = s[0]   ' '   s[1]   ' '   s[2]
  • Related