Home > Software engineering >  Python regex pattern to find string enclosed in preprocessor directives
Python regex pattern to find string enclosed in preprocessor directives

Time:12-07

I'm trying to read C source files with Python to extract loaded header files.
The header files are specified between #ifdef TYPEA and #else OR #endif. If there is an #else-clause, the header files will always be specified before the #else-clause.

Let's assume an excerpt of the content of the source looks like this:

source_content = '\n'.join([
    '#ifdef TYPEZERO',
    '  int someint = 42;',
    '#endif',
    'void abc ( int value) {',
    '  return 5 ** 2.5',
    '}',
    '',
    'abc',
    '#ifdef TYPEA',                                 # <---- begin identifier, may contain leading/trailing whitespaces
    '#include "some_header.h"',                     # <---- I want these lines
    '           #include "some_other_header23.h"',  # <---- I want these lines
    '           #else        ',                     # optional stop identifier, may contain leading/trailing whitespaces
    'double in_fact_int = 5;',                      # some irrelevant content
    '         #endif    ',                          # final stop identifier, may contain leading/trailing whitespaces
    '',
    'a = 5',
    '#ifdef TYPEB',
    '  abc = 23.5;',
    '#endif',
])

I'd like to extract lines with comments excluding #ifdef TYPEA, #else, #endif, such that my result is:

desired_match = '#include "some_header.h"\n           #include "some_other_header23.h"'

print(desired_match)
# Out:    #include "some_header.h"
# Out:               #include "some_other_header23.h"

Removing the whitespaces would be nice, but I'm fine with doing this separately from regex.

My current approach is:

import re

pattern = re.compile(
    r'(\s*.*)#ifdef(\s )TYPEA(\s*)(.*?)(?=((\s*)#else|(\s*)#endif))',
    re.DOTALL
)
match = re.match(pattern, source_content)

print(match.group())
# Out:    #ifdef TYPEZERO
# Out:      int someint = 42;
# Out:    #endif
# Out:    void abc ( int value) {
# Out:      return 5 ** 2.5
# Out:    }
# Out:    
# Out:    abc
# Out:    #ifdef TYPEA
# Out:    #include "some_header.h"
# Out:               #include "some_other_header23.h"

This works fine with cutting off #else or #endif, but as you can see #ifdef TYPEA and, even worse, all preceding lines are also matched.
If I remove the leading (\s*.*) from the pattern (or change it to (\s*)), then I won't see any matching.

How can I exclude the lines before #ifdef TYPEA and possibly also #ifdef TYPEA to get my desired match? Thanks in advance!

CodePudding user response:

You can use re.search instead of re.match and use group numbers to get parts of your regex results.

pattern = re.compile(
    r'\s #ifdef\s TYPEA\s*(.*?)(?=(\s*#else|\s*#endif))',
    re.DOTALL
)
match = re.search(pattern, source_content)

print(match.group(1))

Is it solving your problem?

CodePudding user response:

If you only want what's between #ifdef TYPEA and #else or #endif, you can match the whole thing and create a group between those keywords. re.findall will return the group:

import re
comment_pattern = re.compile(r'#ifdef TYPEA(.*?)(?:#else|#endif)', re.MULTILINE | re.DOTALL)
print(*re.findall(comment_pattern, source_content), sep='\n-------------\n')

Output:

#include "some_header.h"
           #include "some_other_header23.h"

CodePudding user response:

Here's a way using named groups. You had solved most of the problem already.

Notice the change in the regex to include the #ifdef... part within (?P<M>...).

import re

source_content = '\n'.join([
    '#ifdef TYPEZERO',
    '  int someint = 42;',
    '#endif',
    'void abc ( int value) {',
    '  return 5 ** 2.5',
    '}',
    '',
    'abc',
    '#ifdef TYPEA',                                 # <---- begin identifier, may contain leading/trailing whitespaces
    '#include "some_header.h"',                     # <---- I want these lines
    '           #include "some_other_header23.h"',  # <---- I want these lines
    '           #else        ',                     # optional stop identifier, may contain leading/trailing whitespaces
    'double in_fact_int = 5;',                      # some irrelevant content
    '         #endif    ',                          # final stop identifier, may contain leading/trailing whitespaces
    '',
    'a = 5',
    '#ifdef TYPEB',
    '  abc = 23.5;',
    '#endif',
])

pattern = re.compile(
    r'(\s*.*)#ifdef(\s )TYPEA(\s*)(?P<M>(.*?)(?=((\s*)#else|(\s*)#endif)))',
    re.DOTALL
)
match = re.match(pattern, source_content)

print(match.group( "M" ))
  • Related