Home > Back-end >  regex split: ignore delimiter if followed by short substring
regex split: ignore delimiter if followed by short substring

Time:09-05

I have a csv file in which pipes serve as delimiters. But sometimes a short substring follows the 3rd pipe: up to 2 alphanumeric characters behind it. Then the 3rd pipe should not be interpreted as a delimiter.

example: split on each pipe:

x1 = "as234-HJ123-HG|dfdf KHT werg|XXL|s45dtgIKU|2017-SS0|123.45|asUJY"

=> split after XXL because it is followed by more than 2 characters

examples: split on all pipes except the 3rd if there are less than 3 characters between pipes 3 and 4:

x2 = "as234-H344423-dfX|dfer XXYUyu werg|1g|z4|sweDSgIKU|2017-SS0|123.45|YTf"

x3 = "as234-H3wer23-dZ|df3r Xa12yu wg|a1|2|sweDSgIKU|2017-SS0|123.45|YTf"

=> keep "1g|z4" and "a1|2" together.

My regex attempts only suffice for a substring replacement like this one: It replaces the pipe with a hyphen if it finds it between 2 digits: 3|4 => 3-4.

x = re.sub(r'(?<=\d)\|(?=\d)', repl='-', string=x1, count=1).

My question is: If after the third pipe follows a short alphanumeric substring no longer than 1 or 2 characters (like Bx, 2, 42, z or 3b), then re.split should ignore the 3rd pipe and continue with the 4th pipe. All other pipes but #3 are unconditional delimiters.

CodePudding user response:

You can use re.sub to add quotechar around the short columns. Then use Python's builtin csv module to parse the text (regex101 of the used expression)

import re
import csv
from io import StringIO

txt = """\
as234-HJ123-HG|dfdf KHT werg|XXL|s45dtgIKU|2017-SS0|123.45|asUJY
as234-H344423-dfX|dfer XXYUyu werg|1g|z4|sweDSgIKU|2017-SS0|123.45|YTf
as234-H3wer23-dZ|df3r Xa12yu wg|a1|2|sweDSgIKU|2017-SS0|123.45|YTf"""


pat = re.compile(r"^((?:[^|] \|){2})([^|] \|[^|]{,2}(?=\|))", flags=re.M)
txt = pat.sub(r'\1"\2"', txt)

reader = csv.reader(StringIO(txt), delimiter="|", quotechar='"')
for line in reader:
    print(line)

Prints:

['as234-HJ123-HG', 'dfdf KHT werg', 'XXL', 's45dtgIKU', '2017-SS0', '123.45', 'asUJY']
['as234-H344423-dfX', 'dfer XXYUyu werg', '1g|z4', 'sweDSgIKU', '2017-SS0', '123.45', 'YTf']
['as234-H3wer23-dZ', 'df3r Xa12yu wg', 'a1|2', 'sweDSgIKU', '2017-SS0', '123.45', 'YTf']

CodePudding user response:

I adapted Andrej's solution as follows: Assume that the dataframe has already been imported from csv without parsing.

To split the dataframe's single column 0, apply a function that checks if the 3rd pipe is a qualified delimiter.

pat1 is Andrej's solution for identifying if substring4 after the 3rd pipe is longer than 2 characters or not. If it is short, then substring3, pipe3 and substring4 are enclosed within double quotes in text x (in a dataframe, this result type differs from the list shown by the print loop). This part could be replaced by a different regex if your own criterion for "delimiter to ignore" differs from the example.

Next I replace the disqualified pipe(s), those between double quotes, with a hyphen: pat2 in re.sub. The function returns the resulting text y to the new dataframe column "out".

We can get rid of the double quotes in the entire column. They were only needed for the replacements.

Finally, we split column "out" into multiple columns by using all remaining pipe delimiters in str.split.

I suppose my 3 steps could be combined to fewer steps (first enclose 3rd pipe in double quotes if the pipe matches a pattern that disqualifies it as delimiter, then replace the disqualified pipe with a hyphen, then split the text/column). But I'm happy enough that this 3-step solution works.

# identify if 3rd pipe is a valid delimiter:
def cond_replace_3rd_pipe(row):
    # put substring3, 3rd pipe and short substring4 between double quotes
    pat1 = re.compile(r"^((?:[^|] \|){2})([^|] \|[^|]{,2}(?=\|))", flags=re.M)
    x = pat1.sub(r'\1"\2"', row[0])
    
    # replaces pipes between double quotes with hyphen
    pat2 = r'"(. )\|(. )"'
    y = re.sub(pat2, r'"\1-\2"', x)
    return y


df["out"] = df.apply(cond_replace_3rd_pipe, axis=1, result_type="expand")
df["out"] = df["out"].str.replace('"', "")    # double quotes no longer needed
df["out"].str.split('|', expand=True)   # split out into separate columns at all remaining pipes
  • Related