Suppose I have the following text:
test = '\n\nDisclaimer ...........................\t10\n\nITOM - IT Object Model ...............\t11\n\nDB – Datenbank Model..................\t11\n\nDB - Datenbank Model - Views .........\t12'
which looks like:
Disclaimer ........................... 10
ITOM - IT Object Model ............... 11
DB – Datenbank Model.................. 11
DB - Datenbank Model - Views ......... 12
I want to make a list of the contents such that I get:
['Disclaimer', 'ITOM - IT Object Model', 'DB – Datenbank Model', 'DB - Datenbank Model - Views' ]
so I do the following:
re.findall(r'^[a-zA-Z\%\$\#\@\!\-\_]\S*', test1, re.MULTILINE)
which returns:
['Disclaimer', 'ITOM', 'DB', 'DB']
I wonder why my RegEx doesn't pick the words after -
?
CodePudding user response:
You can use a regex and a non-regex approach here:
[line.split('...')[0].strip() for line in test1.splitlines() if line.strip()]
[re.sub(r'\s*\. \s*\d \s*$', '', line) for line in test1.splitlines() if line.strip()]
re.findall(r'^(.*?)[^\S\n]*\. [^\S\n]*\d [^\S\n]*$', test1, re.M)
See the Python demo.
Notes:
- The text is split into separate lines
- Drop the line if it is blank
- Either split the line with triple dots and get the first chunk
- Or, if you prefer regex, remove the dots followed with optional whitespace, then digits and possibly trailing whitespaces.
Or, if you prefer the fully-regex approach (see the third line of code in the above snippet), you can use re.findall
with a ^(.*?)[^\S\n]*\. [^\S\n]*\d [^\S\n]*$
pattern:
^
- start of a line(.*?)
- Group 1: any zero or more chars other than line break chars, as few as possible[^\S\n]*
- zero or more horizontal whitespaces\.
- one or more dots[^\S\n]*
- zero or more horizontal whitespaces\d
- one or more digits[^\S\n]*
- zero or more horizontal whitespaces$
- end of line.
See the regex demo.
CodePudding user response:
I'm proposing an alternate approach, with a different regex. Replace the unwanted characters, instead of finding the needed ones, as it seems easy for your case.
See below:
contents = re.sub(r"\s?(\.) \s (\d) \b", "", text, re.MULTILINE).splitlines(keepends=False)
This will produce a list of contents you want.