Here's my issue: I have a first dataframe which is basically a list of cities, and the country they reside in. I have a second dataframe, with a list of users, and the cities they reside in. I'd like to add a "country" column to the second dataframe, where its value would be based on the "city" column of course, but the city names can me typed differently (for example Washington and washington would both have to give me USA).
I though the best way to do that would be to create a foo(country: String) : String
which would return the country by parsing the first dataframe, but I can't find a way to use this function while creating my new column.
CodePudding user response:
first put in lower case the city column of both dataframes, since you go to join on key city and after, effect the capitalize of the first letter, this code should do what you are looking for:
object Main {
def main(args: Array[String]): Unit = {
val sparkSession: SparkSession = SparkSession.builder()
.master("local[1]")
.appName("SparkByExamples.com")
.getOrCreate()
import sparkSession.implicits._
val citiesDF = Seq(
("London", "England"), ("Washington", "USA")
)
.toDF("city", "country")
.withColumn("city", lower(col("city")))
val usersDF = Seq(
("Andy", "London"), ("Mark", "Washington"), ("Bob", "washington")
)
.toDF("name", "city")
.withColumn("city", lower(col("city")))
val resultDF = citiesDF.join(usersDF, Seq("city"))
.withColumn("city", initcap(col("city")))
resultDF.show()
}
}