Picking words after dash in regex-CodePudding

Suppose I have the following text:

test = '\n\nDisclaimer ...........................\t10\n\nITOM - IT Object Model ...............\t11\n\nDB – Datenbank Model..................\t11\n\nDB - Datenbank Model - Views .........\t12'

which looks like:

Disclaimer ...........................  10

ITOM - IT Object Model ...............  11

DB – Datenbank Model..................  11

DB - Datenbank Model - Views .........  12

I want to make a list of the contents such that I get:

['Disclaimer', 'ITOM - IT Object Model', 'DB – Datenbank Model', 'DB - Datenbank Model - Views' ]

so I do the following:

re.findall(r'^[a-zA-Z\%\$\#\@\!\-\_]\S*', test1, re.MULTILINE)

which returns:

['Disclaimer', 'ITOM', 'DB', 'DB']

I wonder why my RegEx doesn't pick the words after -?

CodePudding user response：

You can use a regex and a non-regex approach here:

[line.split('...')[0].strip() for line in test1.splitlines() if line.strip()]
[re.sub(r'\s*\. \s*\d \s*$', '', line) for line in test1.splitlines() if line.strip()]
re.findall(r'^(.*?)[^\S\n]*\. [^\S\n]*\d [^\S\n]*$', test1, re.M)

See the Python demo.

Notes:

The text is split into separate lines
Drop the line if it is blank
Either split the line with triple dots and get the first chunk
Or, if you prefer regex, remove the dots followed with optional whitespace, then digits and possibly trailing whitespaces.

Or, if you prefer the fully-regex approach (see the third line of code in the above snippet), you can use re.findall with a ^(.*?)[^\S\n]*\. [^\S\n]*\d [^\S\n]*$ pattern:

^ - start of a line
(.*?) - Group 1: any zero or more chars other than line break chars, as few as possible
[^\S\n]* - zero or more horizontal whitespaces
\. - one or more dots
[^\S\n]* - zero or more horizontal whitespaces
\d - one or more digits
[^\S\n]* - zero or more horizontal whitespaces
$ - end of line.

See the regex demo.

CodePudding user response：

I'm proposing an alternate approach, with a different regex. Replace the unwanted characters, instead of finding the needed ones, as it seems easy for your case.

See below:

contents = re.sub(r"\s?(\.) \s (\d) \b", "", text, re.MULTILINE).splitlines(keepends=False)

This will produce a list of contents you want.