I have a spark sql query, where I have to search for multiple identifiers:
SELECT * FROM my_table WHERE identifier IN ('abc', 'cde', 'efg', 'ghi')
Now I get hundreds of results for each of these matches, where I am only interested in the first match for each identifier, i.e. one row with identifier == 'abc'
, one where identifier == 'cde'
and so on.
What is the best way to reduce my result to only the first row for each match?
CodePudding user response:
The best approach certainly depends a bit on your data and also on what you mean by first
. Is that any random row that happens to be returned first? Or first by some particular sort order?
A general flexible approach is using window functions. row_number()
allows you to easily filter for the first row by window.
SELECT * FROM (
SELECT *, row_number() OVER (PARTITION BY identifier ORDER BY ???) as row_num
FROM my_table
WHERE identifier IN ('abc', 'cde', 'efg', 'ghi')) tmp
WHERE
row_num = 1
Though, aggregations like first
or max_by
are often more efficient. But these get quickly inconvenient when dealing with lots of columns.
CodePudding user response:
You can use the first()
aggregation function (after grouping by identifier
) to only get the first row in each group.
But I don't think you'll be able to select *
with this approach. Instead, you can list every individual column you want to get:
SELECT identifier, first(col1), first(col2), first(col3), ...
FROM my_table
WHERE identifier IN ('abc', 'cde', 'efg', 'ghi')
GROUP BY identifier
Another approach would be to fire a query for each identifier value with a limit of 1 and then union all
the results.
With the DataFrame API, you can use your original query and then use .dropDuplicates(["identifier"])
on the result to only keep a single row for each identifier value.