I'm writing a bash script to parse a bunch (a dozen or more) massive Terraform files that contain a large number of google_bigquery_dataset resources and their associated IAM access blocks. The script should take each dataset resource and copy it to another file, named for the dataset itself.
All of this is fine, except extracting the name of the dataset from the resource's "dataset_id" field. This would be easy enough, if not for the fact that some of these dataset resources have authorized view blocks that also contain "dataset_id" values.
Here is an example of such a resource:
resource "google_bigquery_dataset" "project-bigquery-dataset-RESOURCE_NAME" {
access {
role = "WRITER"
special_group = "projectWriters"
}
access {
role = "READER"
special_group = "projectReaders"
}
access {
role = "WRITER"
user_by_email = "[email protected]"
}
access {
role = "OWNER"
special_group = "projectOwners"
}
access {
view {
dataset_id = "DO_NOT_WANT"
project_id = "project"
table_id = "table1"
}
}
access {
view {
dataset_id = "DO_NOT_WANT"
project_id = "project"
table_id = "table2"
}
}
access {
view {
dataset_id = "DO_NOT_WANT"
project_id = "project"
table_id = "table3"
}
}
dataset_id = "THIS_IS_WHAT_I_WANT"
default_partition_expiration_ms = "0"
delete_contents_on_destroy = "false"
labels = {
application-name = "app-name"
}
location = "US"
project = "project"
}
Before I realized that the authorized view blocks also had a dataset_id
field, I was using this to try to grab the value I wanted, assuming startIndex
and endIndex
are just the start and end line numbers representing a complete dataset resource block as above:
fileName=$( sed -n ${startIndex},${endIndex}p $bigFile | grep "dataset_id" | cut -d\" -f2)
Which works only when there are not Authorized View blocks contained other dataset_id
values.
I then tried to use a Negative Lookbehind:
fileName=$( sed -n ${startIndex},${endIndex}p $bigFile | grep '(?<!view {]n)dataset_id' | cut -f1 -d\"
That doesn't work. I'm not sure if it's because of the newline or because of the whitespace between the end of view {
and the start of dataset_id = "DO_NOT_WANT"
.
I've tried variations on it, such as (?<!view\s{\s)\s*dataset_id
without success.
Is there any way to capture only the dataset_id
that isn't in a view block?
A couple notes:
- I can guarantee that
view {
will always precede thedataset_id
in a block, without a line break. - I cannot guarantee the order. It's possible the
dataset_id
I'm trying to capture could be present before theview
blocks, after them, or even somewhere between them. - Desired output for the above example would simply be
THIS_IS_WHAT_I_WANT
Any help would be appreciated.
CodePudding user response:
If your grep
supports -P
(PCRE) option, would you please try the following. It is tested with your shown sample.
grep -Poz '(?:^|\n)(?:(?!view).)*\n\s*dataset_id\s*=\s*"\K[^"] ' input_file
Output:
THIS_IS_WHAT_I_WANT
Assumption
- If
view {
preceds thedataset_id
, the two words span consecutive two lines.
Explanations
- As we need to examine the pattern match across lines,
-z
option is put togrep
to treat the input as sequences of lines. - The regex
(?:^|\n)(?:(?!view).)*\n\s*dataset_id\s*=\s*"\K[^"]
matches (at least) two lines which do not contain the wordview
in the previous line before the line containingdataset_id
. (?:^|\n)
anchors the start of the line, as the multiline option(?m)
does not work due to the-z
option.- As the
lookbehind
assertion does not allow variable length match, we need to use(?:(?!view).)*
as an alternative for(?<!view.*)
. - The following
\n\s*dataset_id
makes sure at least one newline exists betweenview
anddataset_id
. Otherwise the regex matches a single line which just containsdataset_id
causing over-detection. \K
discards the previous matched substring to exclude it in the output.
CodePudding user response:
With your shown samples only, please try following awk
code. Written and tested in GNU awk
.
awk -v RS= -v FS="\n" '
/^[[:space:]] dataset_id[[:space:]] /{
split($1,arr,"\"")
print arr[2]
}
' Input_file
Explanation: Simple explanation for complete code would be:
- Setting
RS
(Record separator) as paragraph mode inawk
program. - Then setting
FS
(Field separator) as new line. - Then in main block checking condition if line starts from 1 or more spaces followed by
dataset_id
followed by again 1 or more spaces, if this condition is TRUE then: - Using
split
function ofawk
to split $1(first field) into an array namedarr
with delimiter of"
. This basically creates an array namedarr
with index of 1 2 3 4 and so on depending upon how many elements it splits based on delimiter. - Then printing array
arr
's 2nd element which is required output by OP.
CodePudding user response:
The nearby
function from rquery
(https://github.com/fuyuncat/rquery/releases) can retrieve the data of previous lines.
The following command line runs rq
twice, the first run retrieves current line and previous line data, the second run then filters the output.
[ rquery]$ ./rq -q "p d/=/ | s trim(@1),trim(trim(@2),'\"'),nearby(trim(@1);@line;-1;'NULL')" samples/terra.txt | ./rq -q "p d/\t/r | s @2 | f @1='dataset_id' and @3!='view {'"
THIS_IS_WHAT_I_WANT