Home > Software design >  Regex ignores negative lookahead
Regex ignores negative lookahead

Time:12-17

I've got the following string:

#index 1#n John Doe#a some University#pc 7#cn 4#hi 1#pi 0.5889

And want to extract the part between #n and the following # with regex. The result should then be:

"John Doe"

This works with the following regex:

(?<=#cn\s).(?:(?!#).)*

However, if the string looks as follows:

#index 1#n #a some University#pc 7#cn 4#hi 1#pi 0.5889

The regex returns:

"#a some University"

But I need it to return an empty string. Can someone help me with this problem?

CodePudding user response:

You may do that by extracting one or more chars other than # after #n and a whitespace:

(?<=#n\s)[^#] 

See the regex demo. The (?<=#n\s) positive lookbehind matches a location immediately preceded with #n and a whitespace, and [^#] matches one or more chars other than #.

If there can be any one or more whitespaces, you can use a capturing group. In PySpark, it will look like

df.withColumn("result", regexp_extract(col("source"), r"#n\s ([^#] )", 1))

See this regex demo. With #n\s ([^#] ), you match #n, one or more whitespaces, and then capture one or more non-#s into Group 1.

  • Related