Home > database >  Divide text in different groups using regex
Divide text in different groups using regex

Time:02-14

I want to extract some data using regex. I'm currently using python and the data can be like this:

text [text1 & text2]
text [text1]
text [text2]
text

There is a "text" value (can be one, or more words) and can be followed by [] or no. In case [] exists, the text inside can be two words separated by & or only one of them (at the moment exists only two possible words but I think is better to get the words by separator instead of matching the exact word)

So my goal is to get the data in two different groups:

  • Group1: text
  • Group2: [text1, text2] (if exists)

My regex is:

(^([^\[]) )(. ?(?=&|\]))?(. ?(?=\]))?

I've tried this using findall which I think fits better with the result I want:

regex = "(^([^\[]) )(. ?(?=&|\]))?(. ?(?=\]))?"
result = re.findall(regex,text)

But it returns:

[('text ', ' ', '[text1 ', '& text2')]

And I want, something like:

[('text'), ('text1', 'text2')]

I've tried in this playground

How can I improve my regex pattern?

Thanks in advance.

CodePudding user response:

Try this regex:

(?x)            # Extended (whitespace and comments are ignored)
([^[] )         # Group 1: 1 or more non-'[' characters
(?:             # Start of a non-capturing group 1
   \s*          # Match optional whitespace
   \[           # Match '['
   ([^&\]] )    # Group 2: 1 or more non- '&' or ']' characters
   (?:          # Start of a non-capturing group 2
      \s*       # Match optional whitespace
      &         # Match '&'
      \s*       # Match optional whitespace
      ([^\]] )  # Capture group 3: 1 or more non-']' characters
      \s*       # Match optional whitespace
   )*           # End of optional non-capturing group 2
   ]            # Match ']'
)*              # End of optional non-capturing group 1

See Regex Demo

import re

tests = [
    'text [text1 & text2]',
    'text [text1]',
    'text [text2]',
    'text'
]

regex = r"""(?x)# Extended (whitespace and comments are ignored)
([^[] )         # Group 1: 1 or more non-'[' characters
(?:             # Start of a non-capturing group 1
   \s*          # Match optional whitespace
   \[           # Match '['
   ([^&\]] )    # Group 2: 1 or more non- '&' or ']' characters
   (?:          # Start of a non-capturing group 2
      \s*       # Match optional whitespace
      &         # Match '&'
      \s*       # Match optional whitespace
      ([^\]] )  # Capture group 3: 1 or more non-']' characters
      \s*       # Match optional whitespace
   )*           # End of optional non-capturing group 2
   ]            # Match ']'
)*              # End of optional non-capturing group 1
"""
for test in tests:
    # Here I am doing a fullmatch, but you can do a search to be more lenient:
    m = re.fullmatch(regex, test)
    if m:
        results = m[1], m[2], m[3]
        print(test, '->', results)

Prints:

text [text1 & text2] -> ('text ', 'text1 ', 'text2')
text [text1] -> ('text ', 'text1', None)
text [text2] -> ('text ', 'text2', None)
text -> ('text', None, None)

You can, of course, group the results any way you want, such as:

results = (m[1], (m[2] or '' , m[3] or ''))

Prints:

text [text1 & text2] -> ('text ', ('text1 ', 'text2'))
text [text1] -> ('text ', ('text1', ''))
text [text2] -> ('text ', ('text2', ''))
text -> ('text', ('', ''))

Update

This regex will do a better job of automatically trimming whitespace if you know there is always whitespace before the '[' and '&' characters:

(?x)            # Extended (whitespace and comments are ignored)
([^[] )         # Group 1: 1 or more non-'[' characters
(?:             # Start of a non-capturing group 1
   \s           # Match whitespace
   \[           # Match '['
   ([^&\]] )    # Group 2: 1 or more non- '&' or ']' characters
   (?:          # Start of a non-capturing group 2
      \s        # Match whitespace
      &         # Match '&'
      \s*       # Match optional whitespace
      ([^\]] )  # Capture group 3: 1 or more non-']' characters
      \s*       # Match optional whitespace
   )*           # End of optional non-capturing group 2
   ]            # Match ']'
)*              # End of optional non-capturing group 1

Code:

import re

tests = [
    'text [text1 & text2]',
    'text [text1]',
    'text [text2]',
    'text'
]

regex = r"""(?x)# Extended (whitespace and comments are ignored)
([^[] )         # Group 1: 1 or more non-'[' characters
(?:             # Start of a non-capturing group 1
   \s           # Match whitespace
   \[           # Match '['
   ([^&\]] )    # Group 2: 1 or more non- '&' or ']' characters
   (?:          # Start of a non-capturing group 2
      \s        # Match whitespace
      &         # Match '&'
      \s*       # Match optional whitespace
      ([^\]] )  # Capture group 3: 1 or more non-']' characters
      \s*       # Match optional whitespace
   )*           # End of optional non-capturing group 2
   ]            # Match ']'
)*              # End of optional non-capturing group 1
"""
for test in tests:
    # Here I am doing a fullmatch, but you can do a search to be more lenient:
    m = re.fullmatch(regex, test)
    if m:
        results = (m[1], (m[2] or '' , m[3] or ''))
        print(test, '->', results)

Prints:

text [text1] -> ('text', ('text1', ''))
text [text2] -> ('text', ('text2', ''))
text -> ('text', ('', ''))

CodePudding user response:

I'm not familiar with python, but I don't think it's possible to get more than one value from one group in a regular expression, so my guess is that you'd have to use something like this

(\w )(?: \[(.*)\])?

Which will give you one or two groups

[('text'), ('text1 & text2')]
[('text'), ('text1')]
[('text'), ('text2')]
[('text')]

And then if the second group exists, use a second regular expression to get each individual value from it.

You can have a look here: https://regex101.com/r/DxF5p1/1

  • Related