I've got the following string:
#index 1#n John Doe#a some University#pc 7#cn 4#hi 1#pi 0.5889
And want to extract the part between #n
and the following #
with regex. The result should then be:
"John Doe"
This works with the following regex:
(?<=#cn\s).(?:(?!#).)*
However, if the string looks as follows:
#index 1#n #a some University#pc 7#cn 4#hi 1#pi 0.5889
The regex returns:
"#a some University"
But I need it to return an empty string. Can someone help me with this problem?
CodePudding user response:
You may do that by extracting one or more chars other than #
after #n
and a whitespace:
(?<=#n\s)[^#]
See the regex demo. The (?<=#n\s)
positive lookbehind matches a location immediately preceded with #n
and a whitespace, and [^#]
matches one or more chars other than #
.
If there can be any one or more whitespaces, you can use a capturing group. In PySpark, it will look like
df.withColumn("result", regexp_extract(col("source"), r"#n\s ([^#] )", 1))
See this regex demo. With #n\s ([^#] )
, you match #n
, one or more whitespaces, and then capture one or more non-#
s into Group 1.