So I feel that I've drained all my other options for this problem so I thought I'd ask here.
I'm creating a CI/CD pipeline for Google Big Query (GBQ) where some of the data needs cleaning and to do that I've tried using regex. The problem here is that the regex I have works fine with all "flavours" except Golang and since Golang doesn't support negative lookup I'm kind of stuck.
This is the regex expression that works for the flavours that support negative lookup:
([,](?![^{\[]*[\}\]]))
This is a sample row of the data I'm trying to cleaenup:
"10016"=>"To be reviewed", "10017"=>"Customer", "10018"=>"[{\"otype\": \"user\", \"oid\": 82}, {\"otype\": \"user\", \"oid\": 52}]", "10020"=>"[{\"otype\": \"article\", \"oid\": 3000}]", "10023"=>"[{\"otype\": \"table\", \"oid\": 56740}]", "10032"=>"Primary", "10033"=>"Yes"
The data row is a string from a column in GBQ and as you can see it has array in it as well as other pointyarrow thingies that link an ID to a value.
In the screenshot you can see that the regex with the negative lookup splits at the commas everywhere except the commas that are inside square and swirly brackets, and that is how it's supposed to do.
Example row with negative lookup regex
My question is, since GBQ doesn't support negative lookup, is there a regex guru here that can help me create a regex expression that doesn't use negative lookup to achieve the same results?
The end result would be, when I use this regex function in GBQ
REGEXP_EXTRACT_ALL(article_custom_field_values, r'regex_expression')
it should split the string per object while keeping the arrays intact, as the screenshot posted.
The desired output should look like this:
"10016"=>"To be reviewed"
"10017"=>"Customer"
"10018"=>"[{\"otype\": \"user\", \"oid\": 82},{\"otype\": \"user\", \"oid\": 52}]"
"10020"=>"[{\"otype\": \"article\", \"oid\": 3000}]"
"10023"=>"[{\"otype\": \"table\", \"oid\": 56740}]"
"10032"=>"Primary"
"10033"=>"Yes"
Link to negative lookup regex at regex101.com: https://regex101.com/r/LRzHuS/1
The main reason I'm trying to use regex is because the data looks totally different on each row, here are some samples on how the data looks like:
"10017"=>"Customer", "10032"=>"Secondary"
"10017"=>"Customer", "10023"=>"[]"
"10017"=>"Retail Range", "10032"=>"Primary"
"10016"=>NULL, "10017"=>"Sales", "10023"=>"[{\"otype\": \"table\", \"oid\": 1027}]", "10032"=>"Secondary"
"10023"=>"[{\"otype\": \"table\", \"oid\": 57692}, {\"otype\": \"table\", \"oid\": 57526}, {\"otype\": \"table\", \"oid\": 57715}, {\"otype\": \"table\", \"oid\": 57717}, {\"otype\": \"table\", \"oid\": 57728}, {\"otype\": \"table\", \"oid\": 57726}, {\"otype\": \"table\", \"oid\": 57720}, {\"otype\": \"table\", \"oid\": 57687}, {\"otype\": \"table\", \"oid\": 57724}, {\"otype\": \"table\", \"oid\": 57584}, {\"otype\": \"table\", \"oid\": 57773}, {\"otype\": \"table\", \"oid\": 743}]"
"10023"=>"[]"
"10014"=>"<p><a data-oid=\"3\" data-otype=\"data\" href=\"/data/3/\"></a> </p>", "10023"=>"[]"
Would really appreciate assistance as I've not been able to create a solid regex expression that achieves this without using negative lookup :(
CodePudding user response:
First of all, I'd recommend using a real JSON parser, as that is likely to be more reliable and more performant.
If you insist on using regex, then I'd recommend matching the actual content instead of matching the separating commas. Please note this approach demands a code change: you'll need to capture the matches rather than split the input on the matches.
"\d "\s*=>\s*"(?:[^"\\]|\\.)*"
Expanded to also cover NULL values:
"\d "\s*=>\s*(?:"(?:[^"\\]|\\.)*"|NULL)
Explanation:
"\d "
- digits, surrounded by quotes (please feel free to match more than just digits)\s*=>\s*
- the arrow, with optional whitespace"(?:[^"\\]|\\.)*"
- the value, surrounded by quotes; making sure\"
does not end the value
Fiddle: https://regex101.com/r/EYtDJl/4