How to use regex delimiter to handle special case?-CodePudding

I have a string like this and the delimiter is | char,

string = "1234|Google | Alphabet|pest||pp| |||r"

the output I am looking for is,

[1234, Google | Alphabet, pest, "", pp, " ", "", "", r]

I used this,

output = re.split("(?<=\w)\|(?=\w)", string) # but this is giving me wrong output

The issue here is Google | Alphabet is a single word since | is separated by space on both ends. Basically if a | is present with space on both sides its part of that word itself else split it. Can someone tell me a good regex to split it properly. I want to use this regex in pandas.read_csv.

I can write a code to handle this manually but I am looking for a better approach to use as a sep (i.e., since it support regex) in pd.read_csv

Thank you.

CodePudding user response：

You can use

\|(?<!\s\|(?=\s))

See the regex demo. Details:

\| - a | char
(?<!\s\|(?=\s)) - that is not immediately preceded with a whitespace and immediately followed with a whitespace.

See the Python demo:

import re
s = "1234|Google | Alphabet|pest||pp| |||r"
print( re.split(r'\|(?<!\s\|(?=\s))', s) )
# => ['1234', 'Google | Alphabet', 'pest', '', 'pp', ' ', '', '', 'r']

CodePudding user response：

Another solution:

import re

s = "1234|Google | Alphabet|pest||pp| |||r"

sep = r"(?:(?<=\S)\|(?=\S))|(?:(?<=\s)\|(?=\S))|(?:(?<=\S)\|(?=\s))"

print(re.split(sep, s))

Prints:

['1234', 'Google | Alphabet', 'pest', '', 'pp', ' ', '', '', 'r']

CodePudding user response：

You could also split asserting not a whitespace char to the left or to the right:

\|(?!\s)|(?<!\s)\|

Regex demo | python demo

import re

s = "1234|Google | Alphabet|pest||pp| |||r"

print(re.split(r"\|(?!\s)|(?<!\s)\|", s))

Output

['1234', 'Google | Alphabet', 'pest', '', 'pp', ' ', '', '', 'r']