I am filtering to a second created table that have duplicates removed. However I'm finding that DISTINCT seems not be working, and I end up with rows with identical ids
. I want to only select one unique ID
and throw any remaining ones away, but this is not what is happening. In other-words I do not care about the other column names.
def de_dupe_affiliates(read_table, write_table):
query = """
CREATE OR REPLACE TABLE `{write_table}` AS
SELECT DISTINCT ID, BRAND, TITLE, SHORT_TITLE, PRICE, FROM `{read_table}`
""".format(read_table=read_table,write_table=write_table)
response = client.query(query).result()
I also tried
SELECT DISTINCT(ID), BRAND
But this did the same. Is it possible to do this with a DISTINCT on one column?
CodePudding user response:
Your select clause
SELECT DISTINCT ID, BRAND, TITLE, SHORT_TITLE, PRICE FROM `{read_table}`
is equivalent to
SELECT ID, BRAND, TITLE, SHORT_TITLE, PRICE FROM `{read_table}` GROUP BY ID, BRAND, TITLE, SHORT_TITLE, PRICE
meaning any differences within these fields creates new rows in your result.
Your query works only if ID, BRAND, TITLE, SHORT_TITLE, PRICE fields are unique.
If not, you may use window functions like row_number()
or rank()
to select one row per id.
CodePudding user response:
Consider below approach
SELECT AS VALUE ANY_VALUE(t) FROM (
SELECT ID, BRAND, TITLE, SHORT_TITLE, PRICE FROM read_table
) t
GROUP BY ID