Home > Software engineering >  regex match two words based on a matching substring
regex match two words based on a matching substring

Time:07-20

there are 4 strings as shown below

ABC_FIXED_20220720_VALUEABC.csv
ABC_FIXED_20220720_VALUEABCQUERY_answer.csv
ABC_FIXED_20220720_VALUEDEF.csv
ABC_FIXED_20220720_VALUEDEFQUERY_answer.csv 

Two strings are considered as matched based on a matching substring value (VALUEABC, VALUEDEF in the above shown strings). Thus I am looking to match first 2 (having VALUEABC) and then next 2 (having VALUEDEF). The matched strings are identified based on the same value returned for one regex group.

What I tried so far

ABC.*[0-9]{8}_(.*[^QUERY_answer])(?:QUERY_answer)?.csv

This returns regex group-1 (from (.*[^QUERY_answer])) value "VALUEABC" for first 2 strings and "VALUEDEF" for next 2 strings and thus desired matching achieved.

But the problem with above regex is that as soon as the value ends with any of the characters of "QUERY_answer", the regex doesn't match any value for the grouping. For instance, the below 2 strings doesn't match at all as the VALUESTU ends with "U" here :

ABC_FIXED_20220720_VALUESTU.csv
ABC_FIXED_20220720_VALUESTUQUERY_answer.csv

I tried to use Negative Lookahead:

ABC.*[0-9]{8}_(.*(?!QUERY_answer))(?:QUERY_answer)?.csv

but in this case the grouping-1 value is returned as "VALUESTU" for first string and "VALUESTUQUERY_answer" for second string, thus effectively making the 2 strings unmatched.

Any way to achieve the desired matching?

CodePudding user response:

You need

ABC.*[0-9]{8}_(.*?)(?:QUERY_answer)?\.csv

See the regex demo.

Note

  • .*[^QUERY_answer] matches any zero or more chars other than line break chars as many as possible, and then any one char other than Q, U, E, etc., i.e. any char in the negated character class. This is replaced with .*?, to match any zero or more chars other than line break chars as few as possible.
  • (?:QUERY_answer)? - the group is made non-capturing to reduce grouping complexity.
  • \.csv - the . is escaped to match a literal dot.

CodePudding user response:

With your shown samples please try following regex.

^ABC_[^_]*_[0-9] _(.*?)(?:QUERY_answer)?\.csv$

OR to match exact 8 digits try:

^ABC_[^_]*_[0-9]{8}_(.*?)(?:QUERY_answer)?\.csv$

Here is the online demo for above regex.

Explanation: Adding detailed explanation for above regex.

^ABC_[^_]*_        ##Matching from starting of value ABC followed by _ till next occurrence of _.
[0-9] _            ##Matching continuous occurrences of digits followed by _ here.
(.*?)              ##Creating one and only capturing group using lazy match which is opposite of greedy match.
(?:QUERY_answer)?  ##In a non-capturing group matching QUERY_answer and keeping it optional.
\.csv$             ##Matching dot literal csv at the end of the value.
  • Related