Reduce results to first match for each pattern with spark sql-CodePudding

I have a spark sql query, where I have to search for multiple identifiers:

SELECT * FROM my_table WHERE identifier IN ('abc', 'cde', 'efg', 'ghi')

Now I get hundreds of results for each of these matches, where I am only interested in the first match for each identifier, i.e. one row with identifier == 'abc', one where identifier == 'cde' and so on.

What is the best way to reduce my result to only the first row for each match?

CodePudding user response：

The best approach certainly depends a bit on your data and also on what you mean by first. Is that any random row that happens to be returned first? Or first by some particular sort order?

A general flexible approach is using window functions. row_number() allows you to easily filter for the first row by window.

SELECT * FROM (
  SELECT *, row_number() OVER (PARTITION BY identifier ORDER BY ???) as row_num
  FROM my_table
  WHERE identifier IN ('abc', 'cde', 'efg', 'ghi')) tmp
WHERE
  row_num = 1

Though, aggregations like first or max_by are often more efficient. But these get quickly inconvenient when dealing with lots of columns.

CodePudding user response：

You can use the first() aggregation function (after grouping by identifier) to only get the first row in each group. But I don't think you'll be able to select * with this approach. Instead, you can list every individual column you want to get:

SELECT identifier, first(col1), first(col2), first(col3), ...
    FROM my_table
    WHERE identifier IN ('abc', 'cde', 'efg', 'ghi')
    GROUP BY identifier

Another approach would be to fire a query for each identifier value with a limit of 1 and then union all the results.

With the DataFrame API, you can use your original query and then use .dropDuplicates(["identifier"]) on the result to only keep a single row for each identifier value.