My requirement is to split a json file in rows for each element within the json. I order to do that I've figure it out that the query to split by rows is something like this:
SELECT EXPLODE(SPLIT(regexp_replace(json_string, '}, "ChIJA7Vlp5iZpBIRo8rMYZedIDo": {', '}\n{'), '\n')) AS json_exploded
Where as you can see the only sequence that allows me to split by rows is this one:
}, "A CODE OF 27 ALPHA CHARACTERS": {
the thing is that this string within the expresion is changing for every element within the json.
So I should find a pattern for this
}, "ChIJA7Vlp5iZpBIRo8rMYZedIDo": {
in order to replace it by this:
}\n{
So then, Which should be the regex for something like this:
}, "ChIJA7Vlp5iZpBIRo8rMYZedIDo": {
Can you help please?
CodePudding user response:
You haven't said which programming language you are using. Many languages support extract/replace by groups. The following regex contains three capturing groups
(.*\}, ")(\w*)(": \{.*)
$1 = SELECT EXPLODE(SPLIT(regexp_replace(json_string, '}, "
$2 = ChIJA7Vlp5iZpBIRo8rMYZedIDo
$3 = ": {', '}\n{'), '\n')) AS json_exploded
Which can be refered to with $1 $2 $3
or \1 \2 \3
depending on the programming language that you are using.
The expression $1\n$3
or \1\n\3
would therefore equal SELECT EXPLODE(SPLIT(regexp_replace(json_string, '}, "\n": {', '}\n{'), '\n')) AS json_exploded
CodePudding user response:
I'm connected to an EMR Cluster (spark under the hood) through a Thrift Server so that I can query json files from DataGrip.
SELECT regexp_replace('{{ "timestamp": "20220302_145927", "hexagon": "8839446009fffff", "city_code": "BCN"}, "ChIJ_7Vlp5iZpBI-o8rMYZedIDo": {"raw_data": {"business_status"}}'
, 'REGEX_PATTERN_HERE'
, '\n') AS string_fixed
So if I replace that chunk of the string:
}, "ChIJA7Vlp5iZpBIRo8rMYZedIDo": {
by a '}\n{'
later I will be available to split by them and I will my rows.