This is a follow up to my previous question.
I'm trying to read C source files with Python to extract loaded header files.
The header files are specified between #ifdef TYPEA
and #else
OR #endif
. If there is an #else
-clause, the header files will always be specified before the #else
-clause.
Let's assume an excerpt of the content of the source looks like this:
source_content = '\n'.join([
'void abc ( int value) {',
' return 5 ** 2.5',
'}',
'',
'abc',
'',
'#ifdef TYPEA', # <---- begin identifier, may contain leading/trailing whitespaces
'some_var_is_set = 42.42;', # <---- I do not need these lines, but it's ok to get them
'// some annoying comment', # <---- I do not need these lines, but it's ok to get them
'#include "some_header.h"', # <---- I want these lines
' #include "some_other_header23.h"', # <---- I want these lines
'// another comment', # <---- I do not need these lines, but it's ok to get them
' some_other_var = 3.159;', # <---- I do not need these lines, but it's ok to get them
'#include"some_header.h"', # <---- I want these lines, even though it is a dupe
' #else ', # optional stop identifier, may contain leading/trailing whitespaces
'double in_fact_int = 5;', # some irrelevant content
' #endif ', # final stop identifier, may contain leading/trailing whitespaces
'',
'#ifdef TYPEB',
' abc = 23.5;',
'#endif',
])
I'd like to extract lines 6-14 between #ifdef TYPEA
, #else
, #endif
, such that my result is:
desired_match = 'some_var_is_set = 42.42;\n// some annoying comment\n#include "some_header.h"\n #include "some_other_header23.h"\n// another comment\n some_other_var = 3.159;\n#include"some_header.h"'
print(desired_match)
# Out: some_var_is_set = 42.42;
// some annoying comment
#include "some_header.h"
#include "some_other_header23.h"
// another comment
some_other_var = 3.159;
#include"some_header.h"
Removing everything NOT including #include
would be nice, but is not required.
My current approach is:
import re
pattern = re.compile(
(
r'(\s*.*)#ifdef(\s )TYPEA(\s*)'
r'(?P<sources>(#include.*?)(?=((\s*)#else|(\s*)#endif)))'
), re.DOTALL
)
match = re.match(pattern, source_content)
This works fine as long as there are no comments or other statements withing the enclosing #ifdef/#else/#endif
. But as soon as there are comments etc., no match is returned.
The #include
in the pattern
is required, since there will be other blocks starting with #ifdef TYPEA
which will not contain any include statements, but only one TYPEA
-block with includes.
Thanks in advance!
CodePudding user response:
You can use the following regex with re.search
(note re.match
only returns matches that are found at the start of a string, so re.search
is more versatile):
#ifdef\s TYPEA\s*(.*?)(?=\s*#(?:else|endif))
If you need multiple matches, you can plug this regex into a re.findall
.
See the regex demo. Details:
#ifdef
- a fixed string\s
- one or more whitespacesTYPEA
- a fixed string\s*
- zero or more whitespaces(.*?)
- Group 1: any zero or more chars as few as possible (also matches line break chars asre.DOTALL
is used)(?=\s*#(?:else|endif))
- a positive lookahead that matches a location that is immediately followed with zero or more whitespaces and then either#else
or#endif
.
See the Python demo, too.