How to RegEx between values that can be multiline or singleline-CodePudding

I have a long text from which I need to extract data. I am trying to use RegEx but with little success. I did my research, tried a lot of things, but it is not working.

The pattern should:

Find the string: "Adónem számlaszáma: "
Return the account number after that
Go backwards UNTIL the first word with 3 digits
Return that 3-letter code
Return the text between this code and the first string

Part of the text:

Időszak: 2021.01.01-2021.11.24
/
101 Társasági adó   Adónem számlaszáma: 10032000-01076019
2021.01.01.
Nyitóegyenleg

Pattern used:

*Flags used: global, single line*

(\b\d\d\d\b)( .*?)Adónem számlaszáma: (.*?)\n

Match is good:

Another part of the text:

-13 000


    101 adónemen többlet:   5 000 Ft
104 Általános forgalmi adó  Adónem számlaszáma: 10032000-01076868

Same pattern used.

Match is not good:

This is the full file I am working with: samplefile.txt

What am I missing? I have the lazy quantifier, dot matches newline etc... Thank you in advance.

CodePudding user response：

If you do not need to match across lines, you may get it done with

\b\d{3}\b\s*(.*?)\s*Adónem számlaszáma: (\S*)

See this regex in action.

Otherwise, you would need to make sure there are no other 3-digit numbers between a 3-digit number and your fixed string:

\b\d{3}\b\s*((?:(?!\b\d{3}\b)[\s\S])*?)\s*Adónem számlaszáma: (\S*)

See this demo. Let me explain the second pattern as it is more specific:

\b\d{3}\b - three digits enclosed with word boundaries
\s* - zero or more whitespaces
((?:(?!\b\d{3}\b)[\s\S])*?) - Group 1: any char ([\s\S]), zero or more repetitions but as few as possible (*?), that does not start a 3-digit number enclosed with word boundaries
Adónem számlaszáma: - a fixed string
(\S*) - Group 2: zero or more non-whitespace chars.