So I have a pandas column that looks like this:
full_name = pd.Series([
'Reservoir 1 Compartment 1',
'Reservoir 1 Common Inlet',
'Reservoir 2 Compartment 1',
'Vyrnwy Line 2 Balancing Tank 1',
'Reservoir 1'
])
I am trying to split it into two columns. The expected output should look like this:
[['Reservoir 1', 'Compartment 1'],
['Reservoir 1', 'Common Inlet'],
['Reservoir 2', 'Compartment 1'],
['Vyrnwy Line 2', 'Balancing Tank 1'],
['Reservoir 1', None]]
I have tried this:
res_compartment_split = pd.concat([full_name.str.split(r'\s\s*?(?=[A-Z])', expand=True)])
but I get this output
[['Reservoir 1', 'Compartment 1', None, None],
['Reservoir 1', 'Common', 'Inlet', None],
['Reservoir 2', 'Compartment 1', None, None],
['Vyrnwy', 'Line 2', 'Balancing', 'Tank 1'],
['Reservoir 1', None, None, None]]
Thanks for any help.
CodePudding user response:
Try the following:
import pandas as pd
full_name = pd.Series([
'Reservoir 1 Compartment 1',
'Reservoir 1 Common Inlet',
'Reservoir 2 Compartment 1',
'Vyrnwy Line 2 Balancing Tank 1',
'Reservoir 1'
])
res = full_name.str.split('(?<=\d)\s (?=[A-Z])', expand=True)
Output:
>>> res
0 1
0 Reservoir 1 Compartment 1
1 Reservoir 1 Common Inlet
2 Reservoir 2 Compartment 1
3 Vyrnwy Line 2 Balancing Tank 1
4 Reservoir 1 None
Explanation of the regex pattern:
(?<=\d)
- positive lookbehind: ensures that there is a digit right before the separator, without consuming it\s
- separator: matches one or more whitespace(?=[A-Z])
- positive lookahead: ensures that there is a letter (A to Z) right after, without consuming it
See it in action using regex101.com.
Also, you can see here why your pattern doesn't work: https://regex101.com/r/nSmEEs/1 .
CodePudding user response:
You can extract strings that consist of any non-digits ending with one or more digits or end of string:
full_name.str.findall(r"\s*(\D (?:\d |$))")
Details:
\s*
- zero or more whitespaces(
- start of a capturing group:\D
- one or more non-digits(?:\d |$)
- one or more digits or end of string.
)
- end of the group.
See the regex demo.
CodePudding user response:
In your case the split point can be a space, preceded with a digit.
To match the preceding digit use positive lookbehind:
result = full_name.str.split(r'(?<=\d) ', expand=True)
Then you can set names to columns, e.g.:
result.columns = ['Res', 'Comp']
The result is:
Res Comp
0 Reservoir 1 Compartment 1
1 Reservoir 1 Common Inlet
2 Reservoir 2 Compartment 1
3 Vyrnwy Line 2 Balancing Tank 1
4 Reservoir 1 None