I want to join csv1 with csv2 to result into final_csv, the schema only has String type columns (file contents are as follows):
csv1
emp_name designation salary_col
smith manager 40000
john analyst 35000
adam sr.engineer 50000
eve QA 36000
mills sr.manager 44000
csv2
emp_name designation advance_salary_col
smith manager 2000
john analyst 3030
adam sr.engineer 5044
eve QA 3600
mills sr.manager 4500
final_csv
emp_name designation salary_col advance_salary_col
smith manager 40000 2000
john analyst 35000 3030
adam sr.engineer 50000 5044
eve QA 36000 3600
mills sr.manager 44000 4500
I tried using few methods Union, Intersect, UnionByName but am getting null values for all of the columns in scala in my final_df
val emp_dataDf1 = spark.read.format("csv")
.option("header", "true")
.load("data/emp_data1.csv")
val emp_dataDf2 = spark.read.format("csv")
.option("header", "true")
.load("/data/emp_data2.csv")
val final_df= emp_dataDf1.union(emp_dataDf2)
CodePudding user response:
This is a join. See docs about SQL joins and joins in Spark.
val final_df = emp_dataDf1.join(emp_dataDf2, Seq("empname", "designation"))
CodePudding user response:
NOTE: you need to mention the separator for csv files and if you are using spaces (or any separator for that matter) make sure you use the exact string literal.
here is how you can do it:
val csv1 = spark
.read
.option("header","true")
.option("sep", " ")
.csv("your_csv1_file")
val csv2 = spark
.read
.option("header","true")
.option("sep", " ")
.csv("your_csv2_file")
val joinExpression = Seq("emp_name", "designation")
csv1.join(csv2, joinExpression, "inner").show(false)
/* output *
*
-------- ----------- ---------- ------------------
|emp_name|designation|salary_col|advance_salary_col|
-------- ----------- ---------- ------------------
|smith |manager |40000 |2000 |
|john |analyst |35000 |3030 |
|adam |sr.engineer|50000 |5044 |
|eve |QA |36000 |3600 |
|mills |sr.manager |44000 |4500 |
-------- ----------- ---------- ------------------
*/