SELECT DISTINCT doesn't appear to work with big query-CodePudding

I am filtering to a second created table that have duplicates removed. However I'm finding that DISTINCT seems not be working, and I end up with rows with identical ids. I want to only select one unique ID and throw any remaining ones away, but this is not what is happening. In other-words I do not care about the other column names.

def de_dupe_affiliates(read_table, write_table):
    query = """
    CREATE OR REPLACE TABLE `{write_table}` AS
    SELECT DISTINCT ID, BRAND, TITLE, SHORT_TITLE, PRICE, FROM `{read_table}` 
    """.format(read_table=read_table,write_table=write_table)
    response = client.query(query).result()

I also tried

SELECT DISTINCT(ID), BRAND

But this did the same. Is it possible to do this with a DISTINCT on one column?

CodePudding user response：

Your select clause

SELECT DISTINCT ID, BRAND, TITLE, SHORT_TITLE, PRICE FROM `{read_table}`

is equivalent to

SELECT ID, BRAND, TITLE, SHORT_TITLE, PRICE FROM `{read_table}` GROUP BY ID, BRAND, TITLE, SHORT_TITLE, PRICE

meaning any differences within these fields creates new rows in your result.

Your query works only if ID, BRAND, TITLE, SHORT_TITLE, PRICE fields are unique.

If not, you may use window functions like row_number() or rank() to select one row per id.

CodePudding user response：

Consider below approach

SELECT AS VALUE ANY_VALUE(t) FROM (
  SELECT ID, BRAND, TITLE, SHORT_TITLE, PRICE FROM read_table
) t 
GROUP BY ID