Home > database >  Reduce results to first match for each pattern with spark sql
Reduce results to first match for each pattern with spark sql

Time:03-19

I have a spark sql query, where I have to search for multiple identifiers:

SELECT * FROM my_table WHERE identifier IN ('abc', 'cde', 'efg', 'ghi')

Now I get hundreds of results for each of these matches, where I am only interested in the first match for each identifier, i.e. one row with identifier == 'abc', one where identifier == 'cde' and so on.

What is the best way to reduce my result to only the first row for each match?

CodePudding user response:

The best approach certainly depends a bit on your data and also on what you mean by first. Is that any random row that happens to be returned first? Or first by some particular sort order?

A general flexible approach is using window functions. row_number() allows you to easily filter for the first row by window.

SELECT * FROM (
  SELECT *, row_number() OVER (PARTITION BY identifier ORDER BY ???) as row_num
  FROM my_table
  WHERE identifier IN ('abc', 'cde', 'efg', 'ghi')) tmp
WHERE
  row_num = 1

Though, aggregations like first or max_by are often more efficient. But these get quickly inconvenient when dealing with lots of columns.

CodePudding user response:

You can use the first() aggregation function (after grouping by identifier) to only get the first row in each group. But I don't think you'll be able to select * with this approach. Instead, you can list every individual column you want to get:

SELECT identifier, first(col1), first(col2), first(col3), ...
    FROM my_table
    WHERE identifier IN ('abc', 'cde', 'efg', 'ghi')
    GROUP BY identifier

Another approach would be to fire a query for each identifier value with a limit of 1 and then union all the results.

With the DataFrame API, you can use your original query and then use .dropDuplicates(["identifier"]) on the result to only keep a single row for each identifier value.

  • Related