I am looking to check some large amount of data in GBs containing 2 CSVs. CSV files have no headers and also include only column which has some complex string mixture of numbers and alphabets like this
--------------------------------
| _c0 |
--- ---------------------------
| Hello | world | 1.3123.412 | B |
--- ----------------------------
So far, I am able to converted into the dataframes but not sure , Is there any way to get the row numbers and rows of df1 not found in df2
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
file1 = 'file_path'
file2 = 'file_path'
df1 = spark.read.csv(file1)
df2 = spark.read.csv(file2)
df1.show(truncate=False)
CodePudding user response:
Lets go step by step now that you are still learning df1
------------------------------
|_c0 |
------------------------------
|Hello | world | 1.3123.412 | B|
|Hello | world | 1.3123.412 | C|
------------------------------
df2
------------------------------
|_c0 |
------------------------------
|Hello | world | 1.3123.412 | D|
|Hello | world | 1.3123.412 | C|
------------------------------
generate row numbers using a window function
df1= df1.withColumn('id', row_number().over(Window.orderBy('_c0')))
df2= df2.withColumn('id', row_number().over(Window.orderBy('_c0')))
Use a left semi join. These joins do not keep any values from the right datframe. They only compare values and keep the left dataframes values that are also found in the right dataframe
df1.join(df2, how='left_semi', on='_c0').show(truncate=False)
------------------------------ ---
|_c0 |id |
------------------------------ ---
|Hello | world | 1.3123.412 | C|2 |
------------------------------ ---