I am querying an S3 using Athena and I want to select the key until certain path level. Here are some key examples:
- project=proj1/ID=SS02769/input=client1/version=X3900/data/fold1/file1.csv
- project=proj1/ID=SS02770/input=client1/version=X4500/data/fold1/file2.csv
- project=proj1/ID=SS02775/input=client1/version=X9000/data/fold1/file3.csv
I want to query these rows and select all the string until 'data/' is there any regex expression to use or any sql expression? Actually I tried the next regex expression but does not work :
regex_match = fr'([^/]*[/]){{{depth}}}'
with depth=4 and with the next query :
f"SELECT REGEXP_EXTRACT(key,'{regex_match}') AS path FROM bucket_number_1)\
GROUP BY REGEXP_EXTRACT(key, '{regex_match}')"
but I dot empty dataframe
Any help is appreciated !
CodePudding user response:
You can use
regex_match = r'^((?:[^/]*/){5})'
Details:
^
- start of string((?:[^/]*/){5})
- Capturing group 1:(?:[^/]*/){5}
- five consecutive occurrences of[^/]*
- any zero or more chars other than a/
char/
- a/
char.