Home > Blockchain >  Global matching using PCRE
Global matching using PCRE

Time:12-08

I try to match a comma-separated key-value pairs using PCRE. The text looks like this:

"key1": "value1", "key2": "value2"

and the tested pattern is:

/(\s*)(,?)"([a-zA-Z0-9] ?)":(\s*)"([a-zA-Z0-9] ?)"/gm

The regex can be tested here:

It works in this test page but with this C code it displayes only the first matching group.

pcre *re;
pcre_extra *sd;
const char *error;
int rc, erroroffset, i;
int ovector[OVECCOUNT];
re = pcre_compile(pattern, 0, &error, &erroroffset, NULL);
sd = pcre_study(
    re,             /* result of pcre_compile() */
    0,              /* no options */
    &error);        /* set to NULL or points to a message */
rc = pcre_exec(   /* see below for details of pcre_exec() options */
    re, sd, json, 7, 0, 0, ovector, 30);
pcre_free_study(sd);
printf("Match succeeded at offset %d\n", ovector[0]);
for (i = 0; i < rc; i  ) {
    char *substring_start = json   ovector[2*i];
    int substring_length = ovector[2*i 1] - ovector[2*i];
    printf("-: '%.*s'\n", i, substring_length, substring_start);
}

The result is

Match succeeded at offset 1
 0: '"key1": "value1"'
 1: ''
 2: ''
 3: 'key1'
 4: ' '
 5: 'value1'

but I need to have all matching groups, with

'key2'
'value2'

CodePudding user response:

One execution of pcre_exec() will capture one key-value pair. You have to repeat the pcre_exec() to get the extra key-value pairs. Obviously, on the second iteration, you need to start at a different location in the string — the offset of the last matched character is available as ovector[1], so you'd capture that and use:

size_t offset = 0;

while ((rc = pcre_exec(…, json   offset, …)) == 0)
{
    …
    offset  = ovector[1];
}

You could also investigate pcre_dfa_exec().

I wonder why you capture the (optional) leading white space and comma as separate items — is that information beneficial? It depends on your requirements, of course.

  • Related