I have a table in BigQuery that I want to update on a daily basis whenever new data come in to another table. Now, I also need to do some casting of variable that exist in a nested column in the original dataframe. Hence, I do the following thing:
config {
type: "incremental",
bigquery: {
partitionBy: "DATE(ts)",
clusterBy: ["IN","SN", "CB","B"]
}
}
SELECT distinct
timestamp as ts,
storeName,
DATE(timestamp) as Date,
itemId,
runId,
source,
storeUuid,
CASE WHEN (P.value IS NOT NULL AND SAFE_CAST(REPLACE(P.value, ",", ".") AS FLOAT64) IS NOT NULL)
THEN CAST(REPLACE(P.value, ",", ".") AS FLOAT64)/100
ELSE NULL END as P,
CASE WHEN (BUP.value IS NOT NULL AND SAFE_CAST(REPLACE(BUP.value, ",", ".") AS FLOAT64) IS NOT NULL)
THEN CAST(REPLACE(BUP.value, ",", ".") AS FLOAT64)/100
ELSE NULL END as BUP,
CASE WHEN (NP.value IS NOT NULL AND SAFE_CAST(REPLACE(NP.value, ",", ".") AS FLOAT64) IS NOT NULL)
THEN CAST(REGEXP_REPLACE(NP.value, ",", ".") AS FLOAT64)/100
ELSE NULL END as NP,
CASE WHEN (Quantity.value IS NOT NULL AND SAFE_CAST(REPLACE(Quantity.value, ",", ".") AS INT64) IS NOT NULL)
THEN CAST(REPLACE(Quantity.value, ",", ".") AS INT64)
ELSE NULL END as Quantity,
CASE WHEN (CQ.value IS NOT NULL AND SAFE_CAST(REPLACE(CQ.value, ",", ".") AS INT64) IS NOT NULL)
THEN CAST(REGEXP_REPLACE(CQ.value, ",", ".") AS INT64)
ELSE NULL END as CQ,
CAST(AverageSalesOverAWeek.value AS FLOAT64) AS AverageSalesOverAWeek,
CAST(CentralMainBarcode.value AS STRING) AS CentralMainBarcode,
CAST(MainBarcode.value AS STRING) AS MainBarcode,
CASE WHEN CAST(MB.value AS STRING) IS NULL
THEN CAST(CMB.value AS STRING)
ELSE CAST(MB.value AS STRING) END as B,
itemName,
itemGroup.value as itemGroup,
CAST(PT.value as String) as PT,
department.value as Department,
UC.value as UC,
KP.value as IKP,
NOCD.value as NOCD,
NPCEPEON.value as NPCEPEON ,
CCP.value as CCP,
LDWA.value as LDWA,
IOOS.value as IOOS,
CASE
WHEN ID.value = 'true' THEN 1
WHEN ID.value = 'false' THEN 0
ELSE NULL
END as ID,
CASE
WHEN DT.value = 'true' THEN 1
WHEN DT.value = 'false' THEN 0
ELSE NULL
END as DT,
CASE
WHEN CID.value = 'true' THEN 1
WHEN CID.value = 'false' THEN 0
ELSE NULL
END as CID,
CASE
WHEN HP.value = 'true' THEN 1
WHEN HP.value = 'false' THEN 0
ELSE NULL
END as HP,
CAST(presentation.value AS STRING) AS presentation,
CASE WHEN presentation.value = 'promo' THEN 1 ELSE 0 END as PF,
CAST(InStock.value AS INT64) AS InStock,
CAST(Brand.value AS STRING) AS Brand,
CAST(CB.value AS STRING) AS CentralBrand,
CPD.value as CPD,
CPPD.value as CPPD,
..........
5 more like these
FROM
`table`
LEFT OUTER JOIN UNNEST(itemProperties) AS ReceiptProductDescription ON ReceiptProductDescription.name = 'ReceiptProductDescription'
LEFT OUTER JOIN UNNEST(itemProperties) AS GD ON GD.name = 'GD'
LEFT OUTER JOIN UNNEST(itemProperties) AS MB ON MD.name = 'MB'
LEFT OUTER JOIN UNNEST(itemProperties) AS B ON B.name = 'B'
LEFT OUTER JOIN UNNEST(itemProperties) AS CB ON CB.name = 'CB'
LEFT OUTER JOIN UNNEST(itemProperties) AS IG ON IG.name = 'IG'
LEFT OUTER JOIN UNNEST(itemProperties) AS PT ON PT.name = 'PT'
LEFT OUTER JOIN UNNEST(itemProperties) AS CMB ON CMB .name = 'CMB '
LEFT OUTER JOIN UNNEST(itemProperties) AS CPD ON CPD.name = 'CPD '
LEFT OUTER JOIN UNNEST(itemProperties) AS CPPD ON CPPD .name = 'CPPD '
LEFT OUTER JOIN UNNEST(itemProperties) AS IS ON IS.name = 'IS'
LEFT OUTER JOIN UNNEST(itemProperties) AS Quantity ON Quantity.name = 'Quantity'
LEFT OUTER JOIN UNNEST(itemProperties) AS CCQ ON CCQ.name = 'CCQ '
LEFT OUTER JOIN UNNEST(itemProperties) AS ISD ON ISD.name = 'ISD '
LEFT OUTER JOIN UNNEST(itemProperties) AS DeliveredToday ON DeliveredToday.name = 'DeliveredToday'
LEFT OUTER JOIN UNNEST(itemProperties) AS IOOS ON IOOS.name = 'IOOS '
LEFT OUTER JOIN UNNEST(itemProperties) AS LDWA ON LDWA.name = 'LDWA '
LEFT OUTER JOIN UNNEST(itemProperties) AS NPCEPEON NPCEPEON.name = 'NPCEPEON'
LEFT OUTER JOIN UNNEST(itemProperties) AS NP ON NP.name = 'NP'
LEFT OUTER JOIN UNNEST(itemProperties) AS BUP ON BUP.name = 'BUP'
LEFT OUTER JOIN UNNEST(itemProperties) AS HP ON HP.name = 'HP'
LEFT OUTER JOIN UNNEST(itemProperties) AS CCP ON CCP.name = 'CCP '
LEFT OUTER JOIN UNNEST(itemProperties) AS NPD ON NPD.name = 'NPD '
LEFT OUTER JOIN UNNEST(itemProperties) AS UnitCode ON UnitCode.name = 'UnitCode'
LEFT OUTER JOIN UNNEST(itemProperties) AS P ON P.name = 'P'
.........
5 more like that
${when(incremental(), `where timestamp > (select max(timestamp) from ${self()})`)}
The issue I am facing has nothing to do with the query, because it runs well in BQ. What the issue is though, is that I cannot run the incremental update from Dataform.
I get the following error message:
Error: Resources exceeded during query execution: The query could not be executed in the allotted memory. Peak usage: 162% of limit. Top memory consumer(s): JOIN operations: 99% other/unattributed: 1%
I have seen solution to be applied if groupby not for joins.
How can I solve this?
I should point out (as will be evident to most that read this) that I seldom work with SQL. If another way to efficiently unnest data than left outer joins are avaible, I am all ears.
CodePudding user response:
Some operations must be performed on a single machine[1]. When your data overflows the slot's memory, you experience the error that appears. My approach on what you need is to change the slot type to a higher memory machine[2], or switch to flat-rate pricing to have reservation resources[3].
Otherwise, you will have to follow the query best practices to avoid single slot operations on too much data. One thing that usually helps is to move to STANDARD SQL if you haven't done that yet. Additionally, I’m attaching some information on how the memory in BigQuery works during an execution[4].
[2] https://cloud.google.com/bigquery/docs/slots
[3] https://cloud.google.com/bigquery/docs/reservations-intro
[4] https://cloud.google.com/blog/products/bigquery/in-memory-query-execution-in-google-bigquery