Home > Enterprise >  Positive Lookbehind Stripping Out Metacharacters
Positive Lookbehind Stripping Out Metacharacters

Time:11-24

I need to get the sequence at the end of many urls to label csv files. The approach I have taken gives me the result I want, but I am struggling to understand how I might use a positive lookbehind to capture all the characters after the word 'series' in the url while ignoring any metacharacters? I know I can use re.sub() to delete them, however, I am interested in learning how I can complete the whole process in one regex.

I have searched through many posts on how I might do this, and experimented with lots of different approaches but I haven't been able to figure it out. Mainly with replacing the . after the (?<=series\-) with something to negate that - but it hasn't worked.

url = 'https://yanmarshop.com/en-GB/catalog/all/browse/yanmardata-1019044/yanmar-marine-marine-main-engine-small-qm-series-kbw-10a'

res = re.search(r"(?<=series\-). ", url).group(0)

re.sub('-', '', res)

Which gives the desired result 'kbw10a'

Is it possible to strip out the metacharacter '-' in the positive lookbehind? Is there a better approach to this without the lookaround?

More examples;

 'https://yanmarshop.com/en-GB/catalog/all/browse/yanmardata-1014416/yanmar-marine-marine-main-engine-small-qm-series-kbw10',
 'https://yanmarshop.com/en-GB/catalog/all/browse/yanmardata-1019044/yanmar-marine-marine-main-engine-small-qm-series-kbw-10a',
 'https://yanmarshop.com/en-GB/catalog/all/browse/yanmardata-1018923/yanmar-marine-marine-main-engine-small-qm-series-kh18-a',

CodePudding user response:

You cannot "ignore" chars in a lookaround the way you describe, because in order to match a part of a string, the regex engine needs to consume the part, from left to right, matching all subsequent subpatterns in your regex.

The only way to achieve that is through additional step, removing the hyphens once the match is found. Note that you do not need another regex to remove hyphens, .replace('-', '') will suffice:

url = 'https://yanmarshop.com/en-GB/catalog/all/browse/yanmardata-1019044/yanmar-marine-marine-main-engine-small-qm-series-kbw-10a'
resObj = re.search(r"series-(. )", url)
if resObj:
    res = resObj.group(1).replace('-', '')

Note it is much safer to first run re.search to get the match data object and then access the .group(), else, when there is no match, you may get an exception.

Also, there is no need of any lookarounds in the pattern, a capturing group will work as well.

  • Related