Home > front end >  Find string enclosed in preprocessor directives - with interfering lines
Find string enclosed in preprocessor directives - with interfering lines

Time:12-16

This is a follow up to my previous question.

I'm trying to read C source files with Python to extract loaded header files.
The header files are specified between #ifdef TYPEA and #else OR #endif. If there is an #else-clause, the header files will always be specified before the #else-clause.

Let's assume an excerpt of the content of the source looks like this:

source_content = '\n'.join([
    'void abc ( int value) {',
    '  return 5 ** 2.5',
    '}',
    '',
    'abc',
    '',
    '#ifdef TYPEA',                                 # <---- begin identifier, may contain leading/trailing whitespaces
    'some_var_is_set = 42.42;',     # <---- I do not need these lines, but it's ok to get them
    '// some annoying comment',     # <---- I do not need these lines, but it's ok to get them
    '#include "some_header.h"',                     # <---- I want these lines
    '           #include "some_other_header23.h"',  # <---- I want these lines
    '// another comment',           # <---- I do not need these lines, but it's ok to get them
    '    some_other_var = 3.159;',  # <---- I do not need these lines, but it's ok to get them
    '#include"some_header.h"',                      # <---- I want these lines, even though it is a dupe
    '           #else        ',                     # optional stop identifier, may contain leading/trailing whitespaces
    'double in_fact_int = 5;',                      # some irrelevant content
    '         #endif    ',                          # final stop identifier, may contain leading/trailing whitespaces
    '',
    '#ifdef TYPEB',
    '  abc = 23.5;',
    '#endif',
])

I'd like to extract lines 6-14 between #ifdef TYPEA, #else, #endif, such that my result is:

desired_match = 'some_var_is_set = 42.42;\n// some annoying comment\n#include "some_header.h"\n           #include "some_other_header23.h"\n// another comment\n    some_other_var = 3.159;\n#include"some_header.h"'

print(desired_match)
# Out:    some_var_is_set = 42.42;
// some annoying comment
#include "some_header.h"
           #include "some_other_header23.h"
// another comment
    some_other_var = 3.159;
#include"some_header.h"

Removing everything NOT including #include would be nice, but is not required.

My current approach is:

import re

pattern = re.compile(
    (
        r'(\s*.*)#ifdef(\s )TYPEA(\s*)'
        r'(?P<sources>(#include.*?)(?=((\s*)#else|(\s*)#endif)))'
    ), re.DOTALL
)
match = re.match(pattern, source_content)

This works fine as long as there are no comments or other statements withing the enclosing #ifdef/#else/#endif. But as soon as there are comments etc., no match is returned.
The #include in the pattern is required, since there will be other blocks starting with #ifdef TYPEA which will not contain any include statements, but only one TYPEA-block with includes.

Thanks in advance!

CodePudding user response:

You can use the following regex with re.search (note re.match only returns matches that are found at the start of a string, so re.search is more versatile):

#ifdef\s TYPEA\s*(.*?)(?=\s*#(?:else|endif))

If you need multiple matches, you can plug this regex into a re.findall.

See the regex demo. Details:

  • #ifdef - a fixed string
  • \s - one or more whitespaces
  • TYPEA - a fixed string
  • \s* - zero or more whitespaces
  • (.*?) - Group 1: any zero or more chars as few as possible (also matches line break chars as re.DOTALL is used)
  • (?=\s*#(?:else|endif)) - a positive lookahead that matches a location that is immediately followed with zero or more whitespaces and then either #else or #endif.

See the Python demo, too.

  • Related