I have a long text from which I need to extract data. I am trying to use RegEx but with little success. I did my research, tried a lot of things, but it is not working.
The pattern should:
- Find the string: "Adónem számlaszáma: "
- Return the account number after that
- Go backwards UNTIL the first word with 3 digits
- Return that 3-letter code
- Return the text between this code and the first string
Part of the text:
Időszak: 2021.01.01-2021.11.24
/
101 Társasági adó Adónem számlaszáma: 10032000-01076019
2021.01.01.
Nyitóegyenleg
Pattern used:
*Flags used: global, single line*
(\b\d\d\d\b)( .*?)Adónem számlaszáma: (.*?)\n
Match is good:
Another part of the text:
-13 000
101 adónemen többlet: 5 000 Ft
104 Általános forgalmi adó Adónem számlaszáma: 10032000-01076868
Same pattern used.
Match is not good:
This is the full file I am working with: samplefile.txt
What am I missing? I have the lazy quantifier, dot matches newline etc... Thank you in advance.
CodePudding user response:
If you do not need to match across lines, you may get it done with
\b\d{3}\b\s*(.*?)\s*Adónem számlaszáma: (\S*)
See this regex in action.
Otherwise, you would need to make sure there are no other 3-digit numbers between a 3-digit number and your fixed string:
\b\d{3}\b\s*((?:(?!\b\d{3}\b)[\s\S])*?)\s*Adónem számlaszáma: (\S*)
See this demo. Let me explain the second pattern as it is more specific:
\b\d{3}\b
- three digits enclosed with word boundaries\s*
- zero or more whitespaces((?:(?!\b\d{3}\b)[\s\S])*?)
- Group 1: any char ([\s\S]
), zero or more repetitions but as few as possible (*?
), that does not start a 3-digit number enclosed with word boundariesAdónem számlaszáma:
- a fixed string(\S*)
- Group 2: zero or more non-whitespace chars.