Python regex is not picking the right pattern-CodePudding

I am trying to use regex to identify certain patterns. Here is (part of) the text.

 1. 9000642612 COMPANY NAME  Master facilities PERSONAL LINE LIMIT 100000000 NKF Reference Product name  This limit can be used for.
2, 3667707  Arsenal NV Master facilities PERSONAL LINE LIMIT 1500000000 PND  PERSONAL LINE LIMIT 2250000000 EUR
3, 3667707 COMPANY NAME NV  Master facilities  PERSONAL LINE LIMIT 3833333360 EUR
4, 2664275  MAN UTD  S81Zt Master facilities  Company line limit 2800000000 EUR COMPANY LIMIT 4200000000 EUR 
5, 2664275-06 TRAVIX TRAVEL AGENCY Sub facility voor2664275 referencing This limit

and has more text. I would like to get this as an end result in groups:

 1. (9000642612)(COMPANY NAME)(Master facilities)(PERSONAL LINE LIMIT)(100000000)(NKF)
 2. (3667707)(arsenal NV)(Master facilities)(PERSONAL LINE LIMIT)(1500000000)(PND)(PERSONAL LINE LIMIT)(2250000000)(EUR)
 3. (3667707)(COMPANY NAME NV)(Master facilities)(PERSONAL LINE LIMIT)(3833333360)(EUR)
 4. (2664275)(MAN UTD)(Master facilities)(PERSONAL LINE LIMIT)(2800000000)(EUR)(COMPANY LIMIT)(4200000000)(EUR)
 5.(2664275-06)(TRAVIX TRAVEL AGENCY)(Sub facility voor2664275)()()()

my regex looks like this: (\s[3|9]\d -?\w )(\s ?\w \w \w )(\s?\s?)([a-zA-Z -] ). Most of the time it, is too generic and takes unwanted text. Any suggest or help is highly appreciated!

CodePudding user response：

IF (a big if) you could somehow mark the "standard" texts ("Master facilities", "Personal line limit", etc), for instance by enclosing them in a pair of marker characters, let's say "#", then something could be done:

>>> pat = r'(\d \-?\d{0,})|(\s[A-Z]{3}\s)|(voor\d )|(\w \s{0,}) |(#(\w \s?) #)'

>>> s='9000642612 COMPANY NAME  #Master facilities# #PERSONAL LINE LIMIT# 100000000 NKF Reference Product name  This limit can be used for.'
>>> for m in re.finditer(pat,s):
        print(m)     
<re.Match object; span=(0, 10), match='9000642612'>
<re.Match object; span=(11, 25), match='COMPANY NAME  '>
<re.Match object; span=(25, 44), match='#Master facilities#'>
<re.Match object; span=(45, 66), match='#PERSONAL LINE LIMIT#'>
<re.Match object; span=(67, 76), match='100000000'>
<re.Match object; span=(76, 81), match=' NKF '>
<re.Match object; span=(81, 131), match='Reference Product name  This limit can be used fo>

>>> s='3667707  Arsenal NV #Master facilities# #PERSONAL LINE LIMIT# 1500000000 PND  #PERSONAL LINE LIMIT# 2250000000 EUR'
>>> for m in re.finditer(pat,s):
        print(m)
<re.Match object; span=(0, 7), match='3667707'>
<re.Match object; span=(9, 20), match='Arsenal NV '>
<re.Match object; span=(20, 39), match='#Master facilities#'>
<re.Match object; span=(40, 61), match='#PERSONAL LINE LIMIT#'>
<re.Match object; span=(62, 72), match='1500000000'>
<re.Match object; span=(72, 77), match=' PND '>
<re.Match object; span=(78, 99), match='#PERSONAL LINE LIMIT#'>
<re.Match object; span=(100, 110), match='2250000000'>
<re.Match object; span=(111, 114), match='EUR'>

>>> s='2664275-06 TRAVIX TRAVEL AGENCY #Sub facility# voor2664275 referencing This limit'
>>> for m in re.finditer(pat,s):
        print(m)
<re.Match object; span=(0, 10), match='2664275-06'>
<re.Match object; span=(11, 32), match='TRAVIX TRAVEL AGENCY '>
<re.Match object; span=(32, 46), match='#Sub facility#'>
<re.Match object; span=(47, 58), match='voor2664275'>
<re.Match object; span=(59, 81), match='referencing This limit'>

You may still have some garbage text in the last match, but I suppose that can be easily disposed of.

CodePudding user response：

Please clarify what you are trying to do. For me, it looks like you should write a parser trying to handle these cases. I don't think there is a simple Regex solving these diverse cases.

But for some cases it might suffice. For example, this could give you the index of a row. (Where text is the string containing a row of your text:

re.search("([0-9] )[.,]\s ([0-9,-] )\s", text).group(1)

And this could give you the first column.

re.search("([0-9] )[.,]\s ([0-9,-] )\s", text).group(2)

I can not help you to write the regex for all columns as I don't understand your data. Maybe column two is a company name and column three something different? Maybe it can only take a finite amount of values?