I have a string like this and the delimiter is | char,
string = "1234|Google | Alphabet|pest||pp| |||r"
the output I am looking for is,
[1234, Google | Alphabet, pest, "", pp, " ", "", "", r]
I used this,
output = re.split("(?<=\w)\|(?=\w)", string) # but this is giving me wrong output
The issue here is Google | Alphabet
is a single word since |
is separated by space on both ends. Basically if a |
is present with space on both sides its part of that word itself else split it. Can someone tell me a good regex to split it properly. I want to use this regex in pandas.read_csv
.
I can write a code to handle this manually but I am looking for a better approach to use as a sep (i.e., since it support regex)
in pd.read_csv
Thank you.
CodePudding user response:
You can use
\|(?<!\s\|(?=\s))
See the regex demo. Details:
\|
- a|
char(?<!\s\|(?=\s))
- that is not immediately preceded with a whitespace and immediately followed with a whitespace.
See the Python demo:
import re
s = "1234|Google | Alphabet|pest||pp| |||r"
print( re.split(r'\|(?<!\s\|(?=\s))', s) )
# => ['1234', 'Google | Alphabet', 'pest', '', 'pp', ' ', '', '', 'r']
CodePudding user response:
Another solution:
import re
s = "1234|Google | Alphabet|pest||pp| |||r"
sep = r"(?:(?<=\S)\|(?=\S))|(?:(?<=\s)\|(?=\S))|(?:(?<=\S)\|(?=\s))"
print(re.split(sep, s))
Prints:
['1234', 'Google | Alphabet', 'pest', '', 'pp', ' ', '', '', 'r']
CodePudding user response:
You could also split asserting not a whitespace char to the left or to the right:
\|(?!\s)|(?<!\s)\|
import re
s = "1234|Google | Alphabet|pest||pp| |||r"
print(re.split(r"\|(?!\s)|(?<!\s)\|", s))
Output
['1234', 'Google | Alphabet', 'pest', '', 'pp', ' ', '', '', 'r']