Home > OS >  join or merge two csv files which has two unique columns in spark scala
join or merge two csv files which has two unique columns in spark scala

Time:02-14

join or merge two csv files using spark scala which has unique columns in both files

 Input dataframes
 csv1   emp_name designation salary_col 
 csv2   emp_name  designation advance_salary_col
 
 Output dataframe
 final_csv  emp_name   designation  salary_col advance_salary_col

  csv1 
  emp_name   designation   salary_col
   smith        manager      40000
   john        analyst       35000
   adam       sr.engineer    50000
   eve           QA          36000
   mills      sr.manager     44000 

   csv2 
  emp_name   designation   advance_salary_col
   smith        manager      2000
   john        analyst       3030
   adam       sr.engineer    5044
   eve           QA          3600
   mills      sr.manager     4500 
 
  final_csv 
  emp_name   designation   salary_col   advance_salary_col
   smith        manager      40000           2000    
   john        analyst       35000           3030
   adam       sr.engineer    50000           5044
   eve           QA          36000           3600
   mills      sr.manager     44000           4500

I tried using few methods Union,Intersect,UnionByName but am getting null values for any of the column using scala

      val emp_dataDf1 = spark.read.format("csv")
                         .option("header", "true")
                         .load("data/emp_data1.csv")

       val emp_dataDf2 = spark.read.format("csv")
                         .option("header", "true")
                         .load("/data/emp_data2.csv") 
        val final_df= emp_dataDf1.union(emp_dataDf2)  

CodePudding user response:

This is a join. See docs about SQL joins and joins in Spark.

val final_df = emp_dataDf1.join(emp_dataDf2, Seq("empname", "designation"))

CodePudding user response:

NOTE: you need to mention the separator for csv files and if you are using spaces (or any separator for that matter) make sure you use the exact string literal.

here is how you can do it:

val csv1 = spark
.read
.option("header","true")
.option("sep", " ")
.csv("your_csv1_file")

val csv2 = spark
.read
.option("header","true")
.option("sep", " ")
.csv("your_csv2_file")

val joinExpression = Seq("emp_name", "designation")
csv1.join(csv2, joinExpression, "inner").show(false)


/* output *
*
 -------- ----------- ---------- ------------------ 
|emp_name|designation|salary_col|advance_salary_col|
 -------- ----------- ---------- ------------------ 
|smith   |manager    |40000     |2000              |
|john    |analyst    |35000     |3030              |
|adam    |sr.engineer|50000     |5044              |
|eve     |QA         |36000     |3600              |
|mills   |sr.manager |44000     |4500              |
 -------- ----------- ---------- ------------------ 

*/
  • Related