I'm trying to create a new column based off information from another data table.
df1
Loc Time Wage
1 192 1
3 192 2
1 193 3
5 193 3
7 193 5
2 194 7
df2
Loc City
1 NYC
2 Miami
3 LA
4 Chicago
5 Houston
6 SF
7 DC
desired output:
Loc Time Wage City
1 192 1 NYC
3 192 2 LA
1 193 3 NYC
5 193 3 Houston
7 193 5 DC
2 194 7 Miami
The actual dataframes vary quite largely in terms of row numbers, but its something along the lines of that. I think this might be achievable through .map
but I haven't found much documentation for that online. join
doesn't really seem to fit this situation.
CodePudding user response:
join
is exactly what you need. Try running this in the spark-shell
val sparkSession = SparkSession.builder().appName("my_app").getOrCreate()
import spark.implicits._
val col1 = Seq("loc", "time", "wage")
val data1 = Seq((1, 192, 1), (3, 193, 2), (1, 193, 3), (5, 193, 3), (7, 193, 5), (2, 194, 7))
val col2 = Seq("loc", "city")
val data2 = Seq((1, "NYC"), (2, "Miami"), (3, "LA"), (4, "Chicago"), (5, "Houston"), (6, "SF"), (7, "DC"))
val df1 = spark.sparkContext.parallelize(data1).toDF(col1: _*)
val df2 = spark.sparkContext.parallelize(data2).toDF(col2: _*)
val outputDf = df1.join(df2, Seq("loc")) // join on the column "loc"
outputDf.show()
This will output
--- ---- ---- -------
|loc|time|wage| city|
--- ---- ---- -------
| 1| 192| 1| NYC|
| 1| 193| 3| NYC|
| 2| 194| 7| Miami|
| 3| 193| 2| LA|
| 5| 193| 3|Houston|
| 7| 193| 5| DC|
--- ---- ---- -------