I am working with a nested json for creating etl. I am using mainly explode and withColumn to flatten the nested json. One of the key in the json has value for state name. So I want to take that value and convert it into abbreviated state name. For example: Texas should be as TX.
.withColumn("NewColName", $"col.someKey.keyStoringValue")
This is helping me to get the associated value from the required key in the json. So if ...someKey.keyStoringValue's value is Texas, I get Texas from my code. But I want it to be TX. How can I do that? I have tried to create a function like;
def getStateName(state: String): String = {
val maps = Map("Texas"->"TX", "Virginia"->"VA")
maps.apply(state)
}
.....
.....
.withColumn("NewColName", getStateName($"col.someKey.keyStoringValue"))
.....
.....
I am getting this error;
error: type mismatch;
found: org.apache.spark.sql.ColumnName
required: String
CodePudding user response:
You can´t apply a common Scala function when you use DataFrame/Spark SQL api. For using this kind of mapping function you need to create an UDF:
import org.apache.spark.sql.functions.udf
val maps = Map("Texas"->"TX", "Virginia"->"VA")
val fun = udf { in: String => maps(in)}
...
.withColumn("NewColName", fun($"col.someKey.keyStoringValue"))
CodePudding user response:
The method you are looking for is when
(see documentation), which allows you to return a value based on a condition. You'd do something like:
df.withColumn(
"NewColName",
when($"col.someKey.keyStoringValue" === "Texas", "TX")
.when($"col.someKey.keyStoringValue" === "Virginia", "VA")
.otherwise("Unknown")
)