Home > Blockchain >  combine two different csv files and make them into one
combine two different csv files and make them into one

Time:02-17

I want to join csv1 with csv2 to result into final_csv, the schema only has String type columns (file contents are as follows):

  csv1 
  emp_name   designation   salary_col
   smith        manager      40000
   john        analyst       35000
   adam       sr.engineer    50000
   eve           QA          36000
   mills      sr.manager     44000 

   csv2 
  emp_name   designation   advance_salary_col
   smith        manager      2000
   john        analyst       3030
   adam       sr.engineer    5044
   eve           QA          3600
   mills      sr.manager     4500 

 final_csv 
  emp_name   designation   salary_col   advance_salary_col
   smith        manager      40000           2000    
   john        analyst       35000           3030
   adam       sr.engineer    50000           5044
   eve           QA          36000           3600
   mills      sr.manager     44000           4500

I tried using few methods Union, Intersect, UnionByName but am getting null values for all of the columns in scala in my final_df

    val emp_dataDf1 = spark.read.format("csv")
                                 .option("header", "true")
                                 .load("data/emp_data1.csv")

    val emp_dataDf2 = spark.read.format("csv")
                                 .option("header", "true")
                                 .load("/data/emp_data2.csv") 

    val final_df= emp_dataDf1.union(emp_dataDf2)  

CodePudding user response:

This is a join. See docs about SQL joins and joins in Spark.

val final_df = emp_dataDf1.join(emp_dataDf2, Seq("empname", "designation"))

CodePudding user response:

NOTE: you need to mention the separator for csv files and if you are using spaces (or any separator for that matter) make sure you use the exact string literal.

here is how you can do it:

val csv1 = spark
.read
.option("header","true")
.option("sep", " ")
.csv("your_csv1_file")

val csv2 = spark
.read
.option("header","true")
.option("sep", " ")
.csv("your_csv2_file")

val joinExpression = Seq("emp_name", "designation")
csv1.join(csv2, joinExpression, "inner").show(false)


/* output *
*
 -------- ----------- ---------- ------------------ 
|emp_name|designation|salary_col|advance_salary_col|
 -------- ----------- ---------- ------------------ 
|smith   |manager    |40000     |2000              |
|john    |analyst    |35000     |3030              |
|adam    |sr.engineer|50000     |5044              |
|eve     |QA         |36000     |3600              |
|mills   |sr.manager |44000     |4500              |
 -------- ----------- ---------- ------------------ 

*/
  • Related