I am using Pyspark, and I have 4 data frames, each having the same schema. I want to count the distinct ids in them all.
But if I do:
combined_df1 = (df1.select('id')) \
.union(df2.select('id')) \
.union(df3.select('id')) \
.union(df4.select('id'))
combined_df1.select('id').distinct().count()
I get a higher figure than if I do:
combined_df2 = df1 \
.union(df2) \
.union(df3) \
.union(df4)
combined_df2.select('id').distinct().count()
About 10% more rows returned, not an insignificant amount. What's going on here?
Why does it matter whether I select distinct id
before or after the union? It's the same list of id
s either way isn't it?
CodePudding user response:
It is difficult to be sure, but my guess is that your columns/schemas in each dataframe do not have the same order. union
merges your dataframe into one big dataframe where the ID column of the first dataframe is "unioned" with another column of the next dataframes. Therefores, the disctint "fails" because it tries to dedup inapropriated values.
The solution is to replace union
with unionByName
.
To answer exactly your question :
Why does it matter whether I select distinct id before or after the union?
TL;DR ==> Union does not distinct the values.
Let's take a simple example, with 2 dataframes :
df1:
id |
---|
2 |
2 |
3 |
df2:
id |
---|
1 |
1 |
3 |
id1.distinct ==> [2,3]
id2.distinct ==> [1,3]
id1.distinct().union(id2.distinct) ==> ([2,3]).union([1,3]) ==> [2,3,1,3] (4 elements)
id1.union(id2) ==> [2,2,3,1,1,3]
id1.union(id2).distinct() ==> [1,2,3] (only 3 elements)
You see that doing the distinct afterwards removes the duplicates that exist between your dataframes.