Python regex pattern to find string enclosed in preprocessor directives-CodePudding

I'm trying to read C source files with Python to extract loaded header files.
The header files are specified between #ifdef TYPEA and #else OR #endif. If there is an #else-clause, the header files will always be specified before the #else-clause.

Let's assume an excerpt of the content of the source looks like this:

source_content = '\n'.join([
    '#ifdef TYPEZERO',
    '  int someint = 42;',
    '#endif',
    'void abc ( int value) {',
    '  return 5 ** 2.5',
    '}',
    '',
    'abc',
    '#ifdef TYPEA',                                 # <---- begin identifier, may contain leading/trailing whitespaces
    '#include "some_header.h"',                     # <---- I want these lines
    '           #include "some_other_header23.h"',  # <---- I want these lines
    '           #else        ',                     # optional stop identifier, may contain leading/trailing whitespaces
    'double in_fact_int = 5;',                      # some irrelevant content
    '         #endif    ',                          # final stop identifier, may contain leading/trailing whitespaces
    '',
    'a = 5',
    '#ifdef TYPEB',
    '  abc = 23.5;',
    '#endif',
])

I'd like to extract lines with comments excluding #ifdef TYPEA, #else, #endif, such that my result is:

desired_match = '#include "some_header.h"\n           #include "some_other_header23.h"'

print(desired_match)
# Out:    #include "some_header.h"
# Out:               #include "some_other_header23.h"

Removing the whitespaces would be nice, but I'm fine with doing this separately from regex.

My current approach is:

import re

pattern = re.compile(
    r'(\s*.*)#ifdef(\s )TYPEA(\s*)(.*?)(?=((\s*)#else|(\s*)#endif))',
    re.DOTALL
)
match = re.match(pattern, source_content)

print(match.group())
# Out:    #ifdef TYPEZERO
# Out:      int someint = 42;
# Out:    #endif
# Out:    void abc ( int value) {
# Out:      return 5 ** 2.5
# Out:    }
# Out:    
# Out:    abc
# Out:    #ifdef TYPEA
# Out:    #include "some_header.h"
# Out:               #include "some_other_header23.h"

This works fine with cutting off #else or #endif, but as you can see #ifdef TYPEA and, even worse, all preceding lines are also matched.
If I remove the leading (\s*.*) from the pattern (or change it to (\s*)), then I won't see any matching.

How can I exclude the lines before #ifdef TYPEA and possibly also #ifdef TYPEA to get my desired match? Thanks in advance!

CodePudding user response：

You can use re.search instead of re.match and use group numbers to get parts of your regex results.

pattern = re.compile(
    r'\s #ifdef\s TYPEA\s*(.*?)(?=(\s*#else|\s*#endif))',
    re.DOTALL
)
match = re.search(pattern, source_content)

print(match.group(1))

Is it solving your problem?

CodePudding user response：

If you only want what's between #ifdef TYPEA and #else or #endif, you can match the whole thing and create a group between those keywords. re.findall will return the group:

import re
comment_pattern = re.compile(r'#ifdef TYPEA(.*?)(?:#else|#endif)', re.MULTILINE | re.DOTALL)
print(*re.findall(comment_pattern, source_content), sep='\n-------------\n')

Output:

#include "some_header.h"
           #include "some_other_header23.h"

CodePudding user response：

Here's a way using named groups. You had solved most of the problem already.

Notice the change in the regex to include the #ifdef... part within (?P<M>...).

import re

source_content = '\n'.join([
    '#ifdef TYPEZERO',
    '  int someint = 42;',
    '#endif',
    'void abc ( int value) {',
    '  return 5 ** 2.5',
    '}',
    '',
    'abc',
    '#ifdef TYPEA',                                 # <---- begin identifier, may contain leading/trailing whitespaces
    '#include "some_header.h"',                     # <---- I want these lines
    '           #include "some_other_header23.h"',  # <---- I want these lines
    '           #else        ',                     # optional stop identifier, may contain leading/trailing whitespaces
    'double in_fact_int = 5;',                      # some irrelevant content
    '         #endif    ',                          # final stop identifier, may contain leading/trailing whitespaces
    '',
    'a = 5',
    '#ifdef TYPEB',
    '  abc = 23.5;',
    '#endif',
])

pattern = re.compile(
    r'(\s*.*)#ifdef(\s )TYPEA(\s*)(?P<M>(.*?)(?=((\s*)#else|(\s*)#endif)))',
    re.DOTALL
)
match = re.match(pattern, source_content)

print(match.group( "M" ))