I was given assistance with dedupe question I had to remove duplicate rows in databricks with Pyspark.
The dataframe that I execute the code on is as follows:
Id | createdon | SinkModifiedOn | title | caltype |
---|---|---|---|---|
e64650d3-94fb-ec11-82e6-0022481b14fc | 04/07/2022 | 16/01/2023 14:37:14 | Partner | Enterprise |
8b97aa35-1d81-e811-a95c-00224800c9ff | 06/07/2018 | 16/01/2023 14:37:14 | Partner | Enterprise |
bd97aa35-1d81-e811-a95c-00224800c9ff | 06/07/2018 | 17/01/2023 18:08:12 | Partner | Enterprise |
42b5f518-1d81-e811-a95c-00224800c9ff | 06/07/2018 | 17/01/2023 18:08:12 | Partner | Enterprise |
8ab2abf9-6169-ec11-8943-000d3a870a5b | 30/12/2021 | 16/01/2023 14:37:14 | Partner | Basic |
2d51010f-1d81-e811-a95c-00224800c9ff | 06/07/2018 | 16/01/2023 14:37:14 | Partner | Enterprise |
40e1feb7-efa5-e811-a96b-00224800cdc2 | 22/08/2018 | 16/01/2023 14:37:14 | Partner | Enterprise |
d3320875-1e81-e811-a95c-00224800cc97 | 06/07/2018 | 17/01/2023 18:08:12 | Partner | Enterprise |
1ea0055a-1d81-e811-a95c-00224800c9ff | 06/07/2018 | 16/01/2023 14:37:14 | Partner | Enterprise |
81cf613f-1e81-e811-a95c-00224800cc97 | 06/07/2018 | 17/01/2023 18:08:12 | Partner | Enterprise |
fb50010f-1d81-e811-a95c-00224800c9ff | 06/07/2018 | 16/01/2023 14:37:14 | Partner | Enterprise |
af4a3b88-1d81-e811-a95c-00224800c4f1 | 06/07/2018 | 16/01/2023 14:37:14 | Partner | Enterprise |
0551010f-1d81-e811-a95c-00224800c9ff | 06/07/2018 | 16/01/2023 14:37:14 | Partner | Enterprise |
8c20f753-5a78-e811-a95b-00224800c3e8 | 25/06/2018 | 16/01/2023 14:37:14 | Partner | Enterprise |
e597aa35-1d81-e811-a95c-00224800c9ff | 06/07/2018 | 16/01/2023 14:37:14 | Partner | Enterprise |
0adad377-b039-e911-a999-00224800c9f4 | 26/02/2019 | 16/01/2023 14:37:14 | Partner | Enterprise |
3315cd94-1d81-e811-a95c-00224800c4f1 | 06/07/2018 | 16/01/2023 14:37:14 | Partner | Enterprise |
151ca1a8-a586-e811-a95c-00224800c4f1 | 13/07/2018 | 16/01/2023 14:37:14 | Partner | Enterprise |
55ce798d-2081-e811-a95c-00224800cc97 | 06/07/2018 | 17/01/2023 18:08:12 | Partner | Enterprise |
55ce798d-2081-e811-a95c-00224800cc97 | 06/07/2018 | 16/01/2023 13:37:09 | Partner | Enterprise |
6fa9fe08-1d81-e811-a95c-00224800c9ff | 06/07/2018 | 16/01/2023 14:37:14 | Partner | Enterprise |
8197aa35-1d81-e811-a95c-00224800c9ff | 06/07/2018 | 16/01/2023 14:37:14 | Partner | Enterprise |
4151010f-1d81-e811-a95c-00224800c9ff | 06/07/2018 | 16/01/2023 14:37:14 | Partner | Enterprise |
74cf613f-1e81-e811-a95c-00224800cc97 | 06/07/2018 | 16/01/2023 14:37:14 | Partner | Enterprise |
The Pyspark code is as follows:
The PySpark is as follows:
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number
df2 = partdf.withColumn("rn", row_number().over(Window.partitionBy("Id","CreatedOn").orderBy("SinkModifiedOn")))
df3 = df2.filter("rn = 1").drop("rn")
After executing the code absolutely no rows are returned.
Can someone take a look at the code at let me know why after deduping no rows are returned?
My guess is that it has something to do with the SinkModified
on field
CodePudding user response:
Use dropDuplicates
to remove duplicate rows :
df = partdf.dropDuplicates(subset=['Id', 'createdon', 'SinkModifiedOn', 'caltype'])
Or
in this specific case, just simply:
partdf.drop_duplicates()