Home > Software engineering >  Regex to extract 2 lists of connected words
Regex to extract 2 lists of connected words

Time:11-15

I want to extract 2 lists of words that are connected by the sign =. The regex code works for separate lists but not in combination.

Example string: bla word1="word2" blabla abc="xyz" bla bla

One output shall contain the words directly left of =, i.e. word1, abc and the other output shall contain the words directly right of =, i.e. word2, xyz without quotes.

\w (?==\"(?:(?!\").)*\") extracts the words left of =, i.e. word1,abc

=\"(?:(?!\").)*\" extracts the words right of = including quotes and =, i.e. ="word2",="xyz"

How can I combine these 2 queries to a single regex-expression that outputs 2 groups? Quotes and equal signs shall not be outputted.

CodePudding user response:

You can use

([^\s=] )="([^"]*)"

See the regex demo. Details:

  • ([^\s=] ) - Group 1: one or more occurrences of a char other than whitespace and = char
  • =" - a =" substring
  • ([^"]*) - Group 1: zero or more chars other than " char
  • " - a " char.

Note: \w only matches one or more letters, digits and underscores, and won't match if the keys contain, say, hyphens. (?:(?!\").)* tempered greedy token is not efficient, and does not match line break chars. As the negative lookahead only contains a single char pattern (\.), it is more efficient to write it as a negated character class, [^.]*. It also matches line break chars. If you do not want that behavior, just add the \r\n into the negated character class.

CodePudding user response:

This should do what you want:

(?: (\w*)=)(?:\"(\w*)\")

This is for a python regex.

You can see it working here.

CodePudding user response:

If you are looking for lhs and rhs from lhs="rhs" this should work (Sorry this what I understood from your question)

import re
test_str='abc="def" ghi'
ans=re.search("(\w )=\"(\w )\"",test_str)
print(ans.group(1))
print(ans.group(2))
my_list=list(ans.groups())
print(my_list)
  • Related