Finding the rows along with the row number of first dataframe not found in second dataframe using Py-CodePudding

I am looking to check some large amount of data in GBs containing 2 CSVs. CSV files have no headers and also include only column which has some complex string mixture of numbers and alphabets like this

 -------------------------------- 
| _c0                            |
 --- --------------------------- 
| Hello | world | 1.3123.412 | B |
 --- ----------------------------

So far, I am able to converted into the dataframes but not sure , Is there any way to get the row numbers and rows of df1 not found in df2

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

file1 = 'file_path'
file2 = 'file_path'


df1 = spark.read.csv(file1)
df2 = spark.read.csv(file2)


df1.show(truncate=False)

CodePudding user response：

Lets go step by step now that you are still learning df1

 ------------------------------ 
|_c0                           |
 ------------------------------ 
|Hello | world | 1.3123.412 | B|
|Hello | world | 1.3123.412 | C|
 ------------------------------

df2

 ------------------------------ 
|_c0                           |
 ------------------------------ 
|Hello | world | 1.3123.412 | D|
|Hello | world | 1.3123.412 | C|
 ------------------------------

generate row numbers using a window function

df1= df1.withColumn('id', row_number().over(Window.orderBy('_c0')))

df2= df2.withColumn('id', row_number().over(Window.orderBy('_c0')))

Use a left semi join. These joins do not keep any values from the right datframe. They only compare values and keep the left dataframes values that are also found in the right dataframe

df1.join(df2, how='left_semi', on='_c0').show(truncate=False)

 ------------------------------ --- 
|_c0                           |id |
 ------------------------------ --- 
|Hello | world | 1.3123.412 | C|2  |
 ------------------------------ ---