Home > Net >  Databricks Deduplicaton PySpark Code Removing All Rows in Table
Databricks Deduplicaton PySpark Code Removing All Rows in Table

Time:01-27

I was given assistance with dedupe question I had to remove duplicate rows in databricks with Pyspark.

The dataframe that I execute the code on is as follows:

Id createdon SinkModifiedOn title caltype
e64650d3-94fb-ec11-82e6-0022481b14fc 04/07/2022 16/01/2023 14:37:14 Partner Enterprise
8b97aa35-1d81-e811-a95c-00224800c9ff 06/07/2018 16/01/2023 14:37:14 Partner Enterprise
bd97aa35-1d81-e811-a95c-00224800c9ff 06/07/2018 17/01/2023 18:08:12 Partner Enterprise
42b5f518-1d81-e811-a95c-00224800c9ff 06/07/2018 17/01/2023 18:08:12 Partner Enterprise
8ab2abf9-6169-ec11-8943-000d3a870a5b 30/12/2021 16/01/2023 14:37:14 Partner Basic
2d51010f-1d81-e811-a95c-00224800c9ff 06/07/2018 16/01/2023 14:37:14 Partner Enterprise
40e1feb7-efa5-e811-a96b-00224800cdc2 22/08/2018 16/01/2023 14:37:14 Partner Enterprise
d3320875-1e81-e811-a95c-00224800cc97 06/07/2018 17/01/2023 18:08:12 Partner Enterprise
1ea0055a-1d81-e811-a95c-00224800c9ff 06/07/2018 16/01/2023 14:37:14 Partner Enterprise
81cf613f-1e81-e811-a95c-00224800cc97 06/07/2018 17/01/2023 18:08:12 Partner Enterprise
fb50010f-1d81-e811-a95c-00224800c9ff 06/07/2018 16/01/2023 14:37:14 Partner Enterprise
af4a3b88-1d81-e811-a95c-00224800c4f1 06/07/2018 16/01/2023 14:37:14 Partner Enterprise
0551010f-1d81-e811-a95c-00224800c9ff 06/07/2018 16/01/2023 14:37:14 Partner Enterprise
8c20f753-5a78-e811-a95b-00224800c3e8 25/06/2018 16/01/2023 14:37:14 Partner Enterprise
e597aa35-1d81-e811-a95c-00224800c9ff 06/07/2018 16/01/2023 14:37:14 Partner Enterprise
0adad377-b039-e911-a999-00224800c9f4 26/02/2019 16/01/2023 14:37:14 Partner Enterprise
3315cd94-1d81-e811-a95c-00224800c4f1 06/07/2018 16/01/2023 14:37:14 Partner Enterprise
151ca1a8-a586-e811-a95c-00224800c4f1 13/07/2018 16/01/2023 14:37:14 Partner Enterprise
55ce798d-2081-e811-a95c-00224800cc97 06/07/2018 17/01/2023 18:08:12 Partner Enterprise
55ce798d-2081-e811-a95c-00224800cc97 06/07/2018 16/01/2023 13:37:09 Partner Enterprise
6fa9fe08-1d81-e811-a95c-00224800c9ff 06/07/2018 16/01/2023 14:37:14 Partner Enterprise
8197aa35-1d81-e811-a95c-00224800c9ff 06/07/2018 16/01/2023 14:37:14 Partner Enterprise
4151010f-1d81-e811-a95c-00224800c9ff 06/07/2018 16/01/2023 14:37:14 Partner Enterprise
74cf613f-1e81-e811-a95c-00224800cc97 06/07/2018 16/01/2023 14:37:14 Partner Enterprise

The Pyspark code is as follows:

The PySpark is as follows:

from pyspark.sql.window import Window
from pyspark.sql.functions import row_number

df2 = partdf.withColumn("rn", row_number().over(Window.partitionBy("Id","CreatedOn").orderBy("SinkModifiedOn")))
df3 = df2.filter("rn = 1").drop("rn")

After executing the code absolutely no rows are returned.

Can someone take a look at the code at let me know why after deduping no rows are returned?

My guess is that it has something to do with the SinkModified on field

CodePudding user response:

Use dropDuplicates to remove duplicate rows :

df = partdf.dropDuplicates(subset=['Id', 'createdon', 'SinkModifiedOn', 'caltype'])

Or

in this specific case, just simply:

partdf.drop_duplicates()
  • Related