Home > Mobile >  Picking words after dash in regex
Picking words after dash in regex

Time:11-24

Suppose I have the following text:

test = '\n\nDisclaimer ...........................\t10\n\nITOM - IT Object Model ...............\t11\n\nDB – Datenbank Model..................\t11\n\nDB - Datenbank Model - Views .........\t12'

which looks like:

Disclaimer ...........................  10

ITOM - IT Object Model ...............  11

DB – Datenbank Model..................  11

DB - Datenbank Model - Views .........  12

I want to make a list of the contents such that I get:

['Disclaimer', 'ITOM - IT Object Model', 'DB – Datenbank Model', 'DB - Datenbank Model - Views' ]

so I do the following:

re.findall(r'^[a-zA-Z\%\$\#\@\!\-\_]\S*', test1, re.MULTILINE)

which returns:

['Disclaimer', 'ITOM', 'DB', 'DB']

I wonder why my RegEx doesn't pick the words after -?

CodePudding user response:

You can use a regex and a non-regex approach here:

[line.split('...')[0].strip() for line in test1.splitlines() if line.strip()]
[re.sub(r'\s*\. \s*\d \s*$', '', line) for line in test1.splitlines() if line.strip()]
re.findall(r'^(.*?)[^\S\n]*\. [^\S\n]*\d [^\S\n]*$', test1, re.M) 

See the Python demo.

Notes:

  • The text is split into separate lines
  • Drop the line if it is blank
  • Either split the line with triple dots and get the first chunk
  • Or, if you prefer regex, remove the dots followed with optional whitespace, then digits and possibly trailing whitespaces.

Or, if you prefer the fully-regex approach (see the third line of code in the above snippet), you can use re.findall with a ^(.*?)[^\S\n]*\. [^\S\n]*\d [^\S\n]*$ pattern:

  • ^ - start of a line
  • (.*?) - Group 1: any zero or more chars other than line break chars, as few as possible
  • [^\S\n]* - zero or more horizontal whitespaces
  • \. - one or more dots
  • [^\S\n]* - zero or more horizontal whitespaces
  • \d - one or more digits
  • [^\S\n]* - zero or more horizontal whitespaces
  • $ - end of line.

See the regex demo.

CodePudding user response:

I'm proposing an alternate approach, with a different regex. Replace the unwanted characters, instead of finding the needed ones, as it seems easy for your case.

See below:

contents = re.sub(r"\s?(\.) \s (\d) \b", "", text, re.MULTILINE).splitlines(keepends=False)

This will produce a list of contents you want.

  • Related