Home > Software design >  How to RegEx between values that can be multiline or singleline
How to RegEx between values that can be multiline or singleline

Time:11-26

I have a long text from which I need to extract data. I am trying to use RegEx but with little success. I did my research, tried a lot of things, but it is not working.

The pattern should:

  • Find the string: "Adónem számlaszáma: "
  • Return the account number after that
  • Go backwards UNTIL the first word with 3 digits
  • Return that 3-letter code
  • Return the text between this code and the first string

Part of the text:

Időszak: 2021.01.01-2021.11.24
/
101 Társasági adó   Adónem számlaszáma: 10032000-01076019
2021.01.01.
Nyitóegyenleg

Pattern used:

*Flags used: global, single line*

(\b\d\d\d\b)( .*?)Adónem számlaszáma: (.*?)\n

Match is good:

match1

Another part of the text:

-13 000


    101 adónemen többlet:   5 000 Ft
104 Általános forgalmi adó  Adónem számlaszáma: 10032000-01076868

Same pattern used.

Match is not good:

match2

This is the full file I am working with: samplefile.txt

What am I missing? I have the lazy quantifier, dot matches newline etc... Thank you in advance.

CodePudding user response:

If you do not need to match across lines, you may get it done with

\b\d{3}\b\s*(.*?)\s*Adónem számlaszáma: (\S*)

See this regex in action.

Otherwise, you would need to make sure there are no other 3-digit numbers between a 3-digit number and your fixed string:

\b\d{3}\b\s*((?:(?!\b\d{3}\b)[\s\S])*?)\s*Adónem számlaszáma: (\S*)

See this demo. Let me explain the second pattern as it is more specific:

  • \b\d{3}\b - three digits enclosed with word boundaries
  • \s* - zero or more whitespaces
  • ((?:(?!\b\d{3}\b)[\s\S])*?) - Group 1: any char ([\s\S]), zero or more repetitions but as few as possible (*?), that does not start a 3-digit number enclosed with word boundaries
  • Adónem számlaszáma: - a fixed string
  • (\S*) - Group 2: zero or more non-whitespace chars.
  • Related