Redshift Regex_Replace : using pattern in replace string-CodePudding

My requirement is to split a json file in rows for each element within the json. I order to do that I've figure it out that the query to split by rows is something like this:

SELECT EXPLODE(SPLIT(regexp_replace(json_string, '}, "ChIJA7Vlp5iZpBIRo8rMYZedIDo": {', '}\n{'), '\n')) AS json_exploded

Where as you can see the only sequence that allows me to split by rows is this one:

}, "A CODE OF 27 ALPHA CHARACTERS": {

the thing is that this string within the expresion is changing for every element within the json.

So I should find a pattern for this

}, "ChIJA7Vlp5iZpBIRo8rMYZedIDo": {

in order to replace it by this:

}\n{

So then, Which should be the regex for something like this:

}, "ChIJA7Vlp5iZpBIRo8rMYZedIDo": {

Can you help please?

CodePudding user response：

You haven't said which programming language you are using. Many languages support extract/replace by groups. The following regex contains three capturing groups

(.*\}, ")(\w*)(": \{.*)

$1 = SELECT EXPLODE(SPLIT(regexp_replace(json_string, '}, "
$2 = ChIJA7Vlp5iZpBIRo8rMYZedIDo
$3 = ": {', '}\n{'), '\n')) AS json_exploded

Which can be refered to with $1 $2 $3 or \1 \2 \3 depending on the programming language that you are using.
The expression $1\n$3 or \1\n\3 would therefore equal SELECT EXPLODE(SPLIT(regexp_replace(json_string, '}, "\n": {', '}\n{'), '\n')) AS json_exploded

CodePudding user response：

I'm connected to an EMR Cluster (spark under the hood) through a Thrift Server so that I can query json files from DataGrip.

SELECT regexp_replace('{{ "timestamp": "20220302_145927", "hexagon": "8839446009fffff", "city_code": "BCN"}, "ChIJ_7Vlp5iZpBI-o8rMYZedIDo": {"raw_data": {"business_status"}}'
                                    , 'REGEX_PATTERN_HERE'
                                    , '\n') AS string_fixed

So if I replace that chunk of the string:

}, "ChIJA7Vlp5iZpBIRo8rMYZedIDo": {

by a '}\n{'

later I will be available to split by them and I will my rows.