Why does selecting distinct before unioning returns more columns than doing it after?-CodePudding

I am using Pyspark, and I have 4 data frames, each having the same schema. I want to count the distinct ids in them all.

But if I do:

combined_df1 = (df1.select('id')) \
  .union(df2.select('id')) \
  .union(df3.select('id')) \
  .union(df4.select('id'))

combined_df1.select('id').distinct().count()

I get a higher figure than if I do:

combined_df2 = df1 \
  .union(df2) \
  .union(df3) \
  .union(df4)

combined_df2.select('id').distinct().count()

About 10% more rows returned, not an insignificant amount. What's going on here?

Why does it matter whether I select distinct id before or after the union? It's the same list of ids either way isn't it?

CodePudding user response：

It is difficult to be sure, but my guess is that your columns/schemas in each dataframe do not have the same order. union merges your dataframe into one big dataframe where the ID column of the first dataframe is "unioned" with another column of the next dataframes. Therefores, the disctint "fails" because it tries to dedup inapropriated values.

The solution is to replace union with unionByName.

To answer exactly your question :

Why does it matter whether I select distinct id before or after the union?

TL;DR ==> Union does not distinct the values.

Let's take a simple example, with 2 dataframes :

df1:

id
2
2
3

df2:

id
1
1
3

id1.distinct ==> [2,3]
id2.distinct ==> [1,3]

id1.distinct().union(id2.distinct) ==> ([2,3]).union([1,3]) ==> [2,3,1,3] (4 elements)

id1.union(id2) ==> [2,2,3,1,1,3]
id1.union(id2).distinct() ==> [1,2,3] (only 3 elements)

You see that doing the distinct afterwards removes the duplicates that exist between your dataframes.