How to check Spark DataFrame difference?-CodePudding

I need to check my solution for idempotency and check how much it's different with past solution.

I tried next:

spark.sql('''
    select * from t1
        except
    select * from t2
''').count()

It's gives me information how much this tables different (t1 - my solution, t2 - primal data). If here is many different data, I want to check, where it different.

So, I tried that:

diff = {}
columns = t1.columns

for col in columns:
    cntr = spark.sql('''
        select {col} from t1
            except
        select {col} from t2
    ''').count()

    diff[col] = cntr

print(diff)

It's not good for me, because it's works about 1-2 hours (both tables have 30 columns and 30 million lines of data).

Do you guys have an idea how to calculate this quickly?

CodePudding user response：

Except is a kind of a join on all columns at the same time. Does your data have a primary key? It could even be complex, comprising of multiple columns, but it's still much better then taking all 30 columns into account.

Once you figure out the primary key you can do the FULL OUTER JOIN and:

check NULLs on the left
check NULLs on the right
check other columns of matching rows (it's much cheaper to compare the values after the join)

CodePudding user response：

Given that your resource remains unchanged, I think there are three ways that you can optimize:

Join two dataframe once but not looping the except: I assume your dataset should have a key / index, otherwise there is no ordering in your both dataframe and you can't perform except to check the difference. Unless you have limited resource, just do join once to concat two dataframe first instead of multiple except.
Check your data partitioning: Even you use point 1 / the method that you're using, make sure that data partition is in even distribution with optimal number of partition. Most of the time, data skew is one of the critical parts to lower your performance. If your key is a string, use repartition. If you're using a sequence number, use repartitionByRange.
Use the when-otherwise pair to check the difference: once you join two dataframe, you can use a when-otherwise condition to compare the difference, for example: df.select(func.sum(func.when(func.col('df1.col_a')!=func.col('df2.col_a'), func.lit(1))).otherwise(func.lit(0)).alias('diff_in_col_a_count')). Therefore, you can calculate all the difference within one action but not multiple action.