Home > Software engineering >  How to get a list of substrings matching a given regex
How to get a list of substrings matching a given regex

Time:01-27

I am attempting to create a list of substrings matching a regex from a string.import re import re

entry = '''
# [control statements](#main-header)3
Describe your writing briefly here, what is the tutorial, how many people are you looking for?

## [4.1. if Statements](#some-header-title1)
accepts one required argument (voltage) and three optional arguments (state, action, and type). This function can be called in any of the following ways
accepts one required argument (voltage) and three optional arguments (state, action, and type). This function can be called in any of the following ways
## [4.2. for Statements](#some-header-title2)
When a final formal parameter of the form **name is present, it receives a dictionary (see Mapping Types — dict) containing all keyword arguments except for those corresponding to a formal parameter. This may be combined with a formal parameter of the form *name (described in the next subsection) which receives a tuple containing the positional arguments beyond the formal parameter list. (*name must occur before **name.) For example, if we define a function like this
## [The range() Function](#some-header-title3)
In many ways the object returned by range() behaves as if it is a list, but in fact it isn’t. It is an object which returns the successive items of the desired sequence when you iterate over it, but it doesn’t really make the list, thus saving space.
# heloo)
We say such an object is iterable, that is, suitable as a target for functions and constructs that expect something from which they can obtain successive items until the supply is exhausted. We have seen that the for statement is such a construct, while an example of a function that takes an iterable is 
'''
pattern = "#.*\)"
subs_list = re.findall(pattern, entry)
print(subs_list)

The above gives me the following list:

 [
     '# [control statements](#main-header)', 
     '## [4.1. if Statements](#some-header-title1)', 
     '## [4.2. for Statements](#some-header-title2)', 
     '## [The range() Function](#some-header-title3)', 
      '# heloo)'
    ]

In stead what I want is this without the '# heloo)'

  [
     '# [control statements](#main-header)',
     '## [4.1. if Statements](#some-header-title1)',
     '## [4.2. for Statements](#some-header-title2)',
     '### [The range() Function](#some-header-title3)'
   ]

Given these sections will always vary between entries, what will be the the better regex pattern to get the section titles matching and contain the ids in parenthesis as # [control statements](#main-header)? Thanks

CodePudding user response:

Ensure optional #\s and required part [ at the beginning of the matched line:

pattern = r"##?\s?\[.*\)"
subs_list = re.findall(pattern, entry)
print(subs_list)

['#[control statements](#main-header)',
 '## [4.1. if Statements](#some-header-title1)',
 '## [4.2. for Statements](#some-header-title2)',
 '## [The range() Function](#some-header-title3)']

CodePudding user response:

Seems like an easy fix.

From the data you have a pattern which is that the parts you want start with [ which means you can add that to the filer.

Regex: # \[.*\)

This should exclude the last one since it doesnt start with the char [

Check it at regex101

  • Related