I'm trying to read C source files with Python to extract loaded header files.
The header files are specified between #ifdef TYPEA
and #else
OR #endif
. If there is an #else
-clause, the header files will always be specified before the #else
-clause.
Let's assume an excerpt of the content of the source looks like this:
source_content = '\n'.join([
'#ifdef TYPEZERO',
' int someint = 42;',
'#endif',
'void abc ( int value) {',
' return 5 ** 2.5',
'}',
'',
'abc',
'#ifdef TYPEA', # <---- begin identifier, may contain leading/trailing whitespaces
'#include "some_header.h"', # <---- I want these lines
' #include "some_other_header23.h"', # <---- I want these lines
' #else ', # optional stop identifier, may contain leading/trailing whitespaces
'double in_fact_int = 5;', # some irrelevant content
' #endif ', # final stop identifier, may contain leading/trailing whitespaces
'',
'a = 5',
'#ifdef TYPEB',
' abc = 23.5;',
'#endif',
])
I'd like to extract lines with comments excluding #ifdef TYPEA
, #else
, #endif
, such that my result is:
desired_match = '#include "some_header.h"\n #include "some_other_header23.h"'
print(desired_match)
# Out: #include "some_header.h"
# Out: #include "some_other_header23.h"
Removing the whitespaces would be nice, but I'm fine with doing this separately from regex.
My current approach is:
import re
pattern = re.compile(
r'(\s*.*)#ifdef(\s )TYPEA(\s*)(.*?)(?=((\s*)#else|(\s*)#endif))',
re.DOTALL
)
match = re.match(pattern, source_content)
print(match.group())
# Out: #ifdef TYPEZERO
# Out: int someint = 42;
# Out: #endif
# Out: void abc ( int value) {
# Out: return 5 ** 2.5
# Out: }
# Out:
# Out: abc
# Out: #ifdef TYPEA
# Out: #include "some_header.h"
# Out: #include "some_other_header23.h"
This works fine with cutting off #else
or #endif
, but as you can see #ifdef TYPEA
and, even worse, all preceding lines are also matched.
If I remove the leading (\s*.*)
from the pattern (or change it to (\s*)
), then I won't see any matching.
How can I exclude the lines before #ifdef TYPEA
and possibly also #ifdef TYPEA
to get my desired match?
Thanks in advance!
CodePudding user response:
You can use re.search
instead of re.match
and use group numbers to get parts of your regex results.
pattern = re.compile(
r'\s #ifdef\s TYPEA\s*(.*?)(?=(\s*#else|\s*#endif))',
re.DOTALL
)
match = re.search(pattern, source_content)
print(match.group(1))
Is it solving your problem?
CodePudding user response:
If you only want what's between #ifdef TYPEA
and #else
or #endif
, you can match the whole thing and create a group between those keywords. re.findall
will return the group:
import re
comment_pattern = re.compile(r'#ifdef TYPEA(.*?)(?:#else|#endif)', re.MULTILINE | re.DOTALL)
print(*re.findall(comment_pattern, source_content), sep='\n-------------\n')
Output:
#include "some_header.h"
#include "some_other_header23.h"
CodePudding user response:
Here's a way using named groups. You had solved most of the problem already.
Notice the change in the regex to include the #ifdef...
part within (?P<M>...)
.
import re
source_content = '\n'.join([
'#ifdef TYPEZERO',
' int someint = 42;',
'#endif',
'void abc ( int value) {',
' return 5 ** 2.5',
'}',
'',
'abc',
'#ifdef TYPEA', # <---- begin identifier, may contain leading/trailing whitespaces
'#include "some_header.h"', # <---- I want these lines
' #include "some_other_header23.h"', # <---- I want these lines
' #else ', # optional stop identifier, may contain leading/trailing whitespaces
'double in_fact_int = 5;', # some irrelevant content
' #endif ', # final stop identifier, may contain leading/trailing whitespaces
'',
'a = 5',
'#ifdef TYPEB',
' abc = 23.5;',
'#endif',
])
pattern = re.compile(
r'(\s*.*)#ifdef(\s )TYPEA(\s*)(?P<M>(.*?)(?=((\s*)#else|(\s*)#endif)))',
re.DOTALL
)
match = re.match(pattern, source_content)
print(match.group( "M" ))