join or merge two csv files using spark scala which has unique columns in both files
Input dataframes
csv1 emp_name designation salary_col
csv2 emp_name designation advance_salary_col
Output dataframe
final_csv emp_name designation salary_col advance_salary_col
csv1
emp_name designation salary_col
smith manager 40000
john analyst 35000
adam sr.engineer 50000
eve QA 36000
mills sr.manager 44000
csv2
emp_name designation advance_salary_col
smith manager 2000
john analyst 3030
adam sr.engineer 5044
eve QA 3600
mills sr.manager 4500
final_csv
emp_name designation salary_col advance_salary_col
smith manager 40000 2000
john analyst 35000 3030
adam sr.engineer 50000 5044
eve QA 36000 3600
mills sr.manager 44000 4500
I tried using few methods Union,Intersect,UnionByName but am getting null values for any of the column using scala
val emp_dataDf1 = spark.read.format("csv")
.option("header", "true")
.load("data/emp_data1.csv")
val emp_dataDf2 = spark.read.format("csv")
.option("header", "true")
.load("/data/emp_data2.csv")
val final_df= emp_dataDf1.union(emp_dataDf2)
CodePudding user response:
This is a join. See docs about SQL joins and joins in Spark.
val final_df = emp_dataDf1.join(emp_dataDf2, Seq("empname", "designation"))
CodePudding user response:
NOTE: you need to mention the separator for csv files and if you are using spaces (or any separator for that matter) make sure you use the exact string literal.
here is how you can do it:
val csv1 = spark
.read
.option("header","true")
.option("sep", " ")
.csv("your_csv1_file")
val csv2 = spark
.read
.option("header","true")
.option("sep", " ")
.csv("your_csv2_file")
val joinExpression = Seq("emp_name", "designation")
csv1.join(csv2, joinExpression, "inner").show(false)
/* output *
*
-------- ----------- ---------- ------------------
|emp_name|designation|salary_col|advance_salary_col|
-------- ----------- ---------- ------------------
|smith |manager |40000 |2000 |
|john |analyst |35000 |3030 |
|adam |sr.engineer|50000 |5044 |
|eve |QA |36000 |3600 |
|mills |sr.manager |44000 |4500 |
-------- ----------- ---------- ------------------
*/