Home > Back-end >  Split Columns in Pandas
Split Columns in Pandas

Time:06-19

So I have a pandas column that looks like this:

full_name = pd.Series([
    'Reservoir 1 Compartment 1', 
    'Reservoir 1 Common Inlet', 
    'Reservoir 2 Compartment 1',
    'Vyrnwy Line 2 Balancing Tank 1', 
    'Reservoir 1'
])

I am trying to split it into two columns. The expected output should look like this:

[['Reservoir 1', 'Compartment 1'],
 ['Reservoir 1', 'Common Inlet'],
 ['Reservoir 2', 'Compartment 1'],
 ['Vyrnwy Line 2', 'Balancing Tank 1'],
 ['Reservoir 1', None]]

I have tried this:

res_compartment_split = pd.concat([full_name.str.split(r'\s\s*?(?=[A-Z])', expand=True)])

but I get this output

[['Reservoir 1', 'Compartment 1', None, None],
 ['Reservoir 1', 'Common', 'Inlet', None],
 ['Reservoir 2', 'Compartment 1', None, None],
 ['Vyrnwy', 'Line 2', 'Balancing', 'Tank 1'],
 ['Reservoir 1', None, None, None]]

Thanks for any help.

CodePudding user response:

Try the following:

import pandas as pd

full_name = pd.Series([
    'Reservoir 1 Compartment 1', 
    'Reservoir 1 Common Inlet', 
    'Reservoir 2 Compartment 1',
    'Vyrnwy Line 2 Balancing Tank 1', 
    'Reservoir 1'
])

res = full_name.str.split('(?<=\d)\s (?=[A-Z])', expand=True)

Output:

>>> res
               0                 1
0    Reservoir 1     Compartment 1
1    Reservoir 1      Common Inlet
2    Reservoir 2     Compartment 1
3  Vyrnwy Line 2  Balancing Tank 1
4    Reservoir 1              None

Explanation of the regex pattern:

  • (?<=\d) - positive lookbehind: ensures that there is a digit right before the separator, without consuming it
  • \s - separator: matches one or more whitespace
  • (?=[A-Z]) - positive lookahead: ensures that there is a letter (A to Z) right after, without consuming it

See it in action using regex101.com.

Also, you can see here why your pattern doesn't work: https://regex101.com/r/nSmEEs/1 .

CodePudding user response:

You can extract strings that consist of any non-digits ending with one or more digits or end of string:

full_name.str.findall(r"\s*(\D (?:\d |$))")

Details:

  • \s* - zero or more whitespaces
  • ( - start of a capturing group:
    • \D - one or more non-digits
    • (?:\d |$) - one or more digits or end of string.
  • ) - end of the group.

See the regex demo.

CodePudding user response:

In your case the split point can be a space, preceded with a digit.

To match the preceding digit use positive lookbehind:

result = full_name.str.split(r'(?<=\d) ', expand=True)

Then you can set names to columns, e.g.:

result.columns = ['Res', 'Comp']

The result is:

             Res              Comp
0    Reservoir 1     Compartment 1
1    Reservoir 1      Common Inlet
2    Reservoir 2     Compartment 1
3  Vyrnwy Line 2  Balancing Tank 1
4    Reservoir 1              None
  • Related