I can count how many are in L_df but not in A_df (in their 'id' column) in numpy:
missing_data = np.isin(L_df['id'], A_df['id'], invert=True).sum()
What is the equivalent code in PySpark to count number of missing data?
CodePudding user response:
You can use an anti
join. Quoting the documentation from here
Anti Join: An anti join returns values from the left relation that has no match with the right. It is also referred to as a left anti join.
Assuming you load the dataframes L_df
and A_df
as spark dataframes, you can use DataFrame.join with anti
join as follows:
L_df.join(A_df, on='id', how='anti').count()