Home > Enterprise >  RegEx to match only a specific column with lookaround
RegEx to match only a specific column with lookaround

Time:10-17

I have a .CSV which I'm handling in a large file editor (BssEditor):

DOC;NAME;A_TYPE;ADDRESS;NUMBER;COMPLEMENT;NEIGHBORHOOD;CITY;STATE;ZIPCODE
7971530;Obi Wan Kenobi;R;OF THE PITANGUEIRAS;0000731;;MATATU;DUBAI;BA;40255436
7971541;Anakim Skywalker;AV;VISCONDE OF JEQUITINHONHA;0000243;AP 601;GOOD VOYAGE;RECIFE;PE;51021190
7971974;Jabba the Hutt;;DOS ILHEUS;0000118;APT 600;CENTER;FLOWERPOLIS;SC;88010560
7972512;Mando;;JUNDIACANGA;0000037;HOUSE;IPAVA CITY;SAINT PAUL;SP;04950150

The column delimiter is ;, and I wanna match all zeros in the beginning of the NUMBER column to replace with nothing.
Ex.: 0000731731

It's easy to match everything with ^((.*?;){4})0 and replace by $1, but not with lookaround...
I tried RegEx like that

/^(?<=.*?;){4}0 /
/(?<=^.*?;.*?;.*?;.*?;)0 / 

but it looks like the greedy wildcard only works within a lookahead, not a lookbehind.

There are a way?
And having a way, is there a performance issue when dealing with millions of entries?

CodePudding user response:

An infinite quantifier in a lookbehind is only supported by a few regex engines (.NET, Python PyPi module, newer Javascript like V8), but not in notepad which uses boost.

If you are using notepad , you don't need lookarounds or capture groups. You could repeat semicolon separated parts until you get to the number column and use \K to clear the current match buffer.

In the replacement use an empty string.

^(?:[^;\n]*;){4}\K0 
  • ^ start of string
  • (?:[^;\n]*;){4} Repeat 4 times matching any char except ; or a newline, then match ;
  • \K Forget what is matched so far
  • 0 Match one or more times a zero

Regex demo

The capture group solution seems like a good solution, you could write it using a single capture group and use a negated character class instead of .*? to prevent some backtracking.

^((?:[^;\n]*;){4})0 

In the replacement use group 1, often notated as $ or \1

Regex demo

CodePudding user response:

I don't know about BssEditor, but the following works in Notepad

(?<=;)0 (?=\d ;(?:[^;]*;){4}[^;]*?$)

A positive lookahead is used to only match if there are exactly five semicolons ahead in the string on that line.

is there a performance issue when dealing with millions of entries?

Possibly.

  • Related