Home > Software design >  count how many are in L_df but not in A_df in spark
count how many are in L_df but not in A_df in spark

Time:11-10

I can count how many are in L_df but not in A_df (in their 'id' column) in numpy:

missing_data = np.isin(L_df['id'], A_df['id'], invert=True).sum()

What is the equivalent code in PySpark to count number of missing data?

CodePudding user response:

You can use an anti join. Quoting the documentation from here

Anti Join: An anti join returns values from the left relation that has no match with the right. It is also referred to as a left anti join.

Assuming you load the dataframes L_df and A_df as spark dataframes, you can use DataFrame.join with anti join as follows:

L_df.join(A_df, on='id', how='anti').count()
  • Related