New to SQL and trying to understand nested queries and how to use them. I have a substring, case statement, and trim statement that I'm trying to put together but am unsure of how. The substring has to be done first, then the case statement, then the trim. This is what I have at the moment but unsure of how to get it working. The code is random names/tables as an example
SELECT dtXYZ.*
FROM
(
SELECT dt,
SUBSTRING_INDEX(SUBSTRING_INDEX(dt, ..................... ) as lioness,
SUBSTRING_INDEX(SUBSTRING_INDEX(dt, .....................) as tiger,
SUBSTRING_INDEX(dt, .................) as bear
FROM Animaltab
) dtXYZ
SELECT
CASE WHEN length(bear) = 4 THEN bear
ELSE concat('0', bear)
END AS bear_corr,
CASE WHEN length(lion) = 7 THEN lioness
ELSE concat('0', lioness)
END AS lion_corr
trim(lion_corr) || '_' || trim(tiger) || '_' || trim(bear_corr) as new_imp_animal
CodePudding user response:
Spark supports CTE https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-cte.html
even with databrics this will work see Common Table Expressions (CTEs) in Databricks and Spark
ANd you can nest them like this
WITH dtXYZ(dt,lioness,tiger,bear() AS ( SELECT dt,
SUBSTRING_INDEX(SUBSTRING_INDEX(dt, ..................... ) as lioness,
SUBSTRING_INDEX(SUBSTRING_INDEX(dt, .....................) as tiger,
SUBSTRING_INDEX(dt, .................) as bear
FROM Animaltab),
dtcorrected (dt,bear_corr,lion_corr,tiger) as (
SELECT
dt,
CASE WHEN length(bear) = 4 THEN bear
ELSE concat('0', bear)
END AS bear_corr,
CASE WHEN length(lion) = 7 THEN lioness
ELSE concat('0', lioness)
END AS lion_corr
,tiger
FROM dtXYZ)
SELECT
dt,
trim(lion_corr) || '_' || trim(tiger) || '_' || trim(bear_corr) as new_imp_animal FROM dtcorrected
CodePudding user response:
Order of operations can be tricky with SQL if you're used to ordering things in a procedure. Like nbk commented, CTEs or Common Table Expressions are your best bet. CTEs are defined by the 'with' keyword and are very similar to nested subqueries (you could write the same query nested if you wanted) but are better suited to this operation where the nesting structure of the code doesn't mimic the nesting of the data. I always use CTEs if I'm joining tables that each need independent grouping or filtering. The SQL in the parenthesis essentially creates a view, and the outside SQL is a second higher-order select statement to create a result set. If I'm working with hierarchical data (parent, child, grandchild), I'll go with the nesting in the query to follow that path, but usually, the CTE is easier to organize your ideas. Here's how that would work:
with dtXYZ as
(
SELECT dt,
SUBSTRING_INDEX(SUBSTRING_INDEX(dt, ..................... ) as lioness,
SUBSTRING_INDEX(SUBSTRING_INDEX(dt, .....................) as tiger,
SUBSTRING_INDEX(dt, .................) as bear
FROM Animaltab
)
SELECT
CASE WHEN length(bear) = 4 THEN bear
ELSE concat('0', bear)
END AS bear_corr,
CASE WHEN length(lion) = 7 THEN lioness
ELSE concat('0', lioness)
END AS lion_corr,
trim(lion_corr) || '_' || trim(tiger) || '_' || trim(bear_corr) as new_imp_animal
from
dtXYZ
And in terms of 'order of operations,' case statements and functions in a select can be referenced by other parts of the select statement as inputs. Things can get hairy when you use 'if' ideas that resolve to illogical or error-causing conditions. Still, otherwise, I've had no issues with having many parts of a select refer to each other. It's an excellent way to test out nesting functions.