Home > Mobile >  Creating a new column using info from another df
Creating a new column using info from another df

Time:10-21

I'm trying to create a new column based off information from another data table.

df1

Loc Time   Wage
1    192    1
3    192    2
1    193    3
5    193    3
7    193    5
2    194    7

df2

Loc  City
1    NYC
2    Miami
3    LA
4    Chicago
5    Houston
6    SF
7    DC

desired output:

Loc Time   Wage  City
1    192    1    NYC
3    192    2    LA
1    193    3    NYC
5    193    3    Houston
7    193    5    DC
2    194    7    Miami

The actual dataframes vary quite largely in terms of row numbers, but its something along the lines of that. I think this might be achievable through .map but I haven't found much documentation for that online. join doesn't really seem to fit this situation.

CodePudding user response:

join is exactly what you need. Try running this in the spark-shell

val sparkSession = SparkSession.builder().appName("my_app").getOrCreate()
import spark.implicits._

val col1 = Seq("loc", "time", "wage")
val data1 = Seq((1, 192, 1), (3, 193, 2), (1, 193, 3), (5, 193, 3), (7, 193, 5), (2, 194, 7))
val col2 = Seq("loc", "city")
val data2 = Seq((1, "NYC"), (2, "Miami"), (3, "LA"), (4, "Chicago"), (5, "Houston"), (6, "SF"), (7, "DC"))

val df1 = spark.sparkContext.parallelize(data1).toDF(col1: _*)
val df2 = spark.sparkContext.parallelize(data2).toDF(col2: _*)

val outputDf = df1.join(df2, Seq("loc"))  // join on the column "loc"

outputDf.show()

This will output

 --- ---- ---- ------- 
|loc|time|wage|   city|
 --- ---- ---- ------- 
|  1| 192|   1|    NYC|
|  1| 193|   3|    NYC|
|  2| 194|   7|  Miami|
|  3| 193|   2|     LA|
|  5| 193|   3|Houston|
|  7| 193|   5|     DC|
 --- ---- ---- ------- 
  • Related