Home > Software design >  Python, Pandas: What it is the best way to separate string into several columns?
Python, Pandas: What it is the best way to separate string into several columns?

Time:06-08

I want to split one string into several columns however I encounter issue with str.split method.

Example: While performing str.split on column Name which consist such strings it work as expected.

data1 = {'Name': ['Alice 23', 'Philip 12', 'Krish 64', 'John 29']}
df1 = pd.DataFrame(data1)
df1
    Name
0   Alice 23
1   Philip 12
2   Krish 64
3   John 29

performing split:

df1[['Name', 'age']] = df1['Name'].str.split(' ', 1, expand=True)
df1

Name    age
0   Alice   23
1   Philip  12
2   Krish   64
3   John    29

All good, as I wanted but if I need to place other separator like || its not working properly.

data2 = {'Name': ['Alice||23', 'Philip||12', 'Krish||64', 'John||29']}
df2 = pd.DataFrame(data)
df2

    Name
0   Alice||23
1   Philip||12
2   Krish||64
3   John||29

performing split...

df2[['Name', 'age']] = df2['Name'].str.split('[||]',1,expand = True)
df3[['Name', 'age']] = df2['Name'].str.split('||',1,expand = True)

results in not what I expected

df2
    Name    age
0   Alice   |23
1   Philip  |12
2   Krish   |64
3   John    |29

df3

Name    age
0       Alice||23
1       Philip||12
2       Krish||64
3       John||29

What's the reason for this behaviour and how get expected result which is as df1?

CodePudding user response:

The issue you're hitting is that pandas defaults to assuming your string to split on is a regular expression. In regular expressions the "|" character is a special character that enables you to match either expression on the left or right of that character (e.g. you can match 'a' or 'b' with the expression '(a|b)'.

In your case, we don't want to pass a regular expression, so you can pass .str.split(…, regex=False).

>>> df2['Name'].str.split('||', regex=False, expand=True)
        0   1
0   Alice  23
1  Philip  12
2   Krish  64
3    John  29

CodePudding user response:

escape the |:

df2['Name'].str.split('\|\|',expand = True)

Output:

        0   1
0   Alice  23
1  Philip  12
2   Krish  64
3    John  29
  • Related