Home > Software design >  Capitalize the first letter of each word | Spark Scala
Capitalize the first letter of each word | Spark Scala

Time:12-14

I have a table as below -

ID City Country
1 Frankfurt am main Germany

The dataframe needs to be displayed by capitalizing the first letter of each word in the city i.e. output should look like this ->

ID City Country
1 Frankfurt Am Main Germany

The solution I worked with is as below ->

df.map(x => x.getString(1).trim().split(' ').map(_.capitalize).mkString(" ")).show()

This only provides the City column aliased as "value".

How can I get all the columns with the above-mentioned transformation implemented?

CodePudding user response:

You can use initcap function Docu

public static Column initcap(Column e)

Returns a new string column by converting the first letter of each word to uppercase. Words are delimited by whitespace.

For example, "hello world" will become "Hello World".

Parameters: e - (undocumented) Returns: (undocumented) Since: 1.5.0

Sample code

import org.apache.spark.sql.functions._

val data = Seq(("1", "Frankfurt am main", "Germany"))
val df = data.toDF("Id", "City", "Country")
df.withColumn("City", initcap(col("City"))).show

And the output is:

 --- ----------------- ------- 
| Id|             City|Country|
 --- ----------------- ------- 
|  1|Frankfurt Am Main|Germany|
 --- ----------------- ------- 

Your sample code was returning only 1 column because that's exactly what you coded in your map. Take x(so your df), get from it column on index 1, do some transformations and return it.

You could do what you wanted with map as you can see in other answers but output of your map needs to include all columns.

Why in my answer i am not doing map? General rule is: when there is build in sql function use it instead of custom map/udf. Most of the time sql function will be better in terms of performance as it easier to optimize for Catalyst

CodePudding user response:

You can call an udf and loop over all columns:

import spark.implicits._
val data = Seq(
  (1, "Frankfurt am main", "just test", "Germany"),
  (2, "should do this also", "test", "France"),
)
val df = spark.sparkContext.parallelize(data).toDF("ID", "City", "test", "Country")

val convertUDF = udf((value: String) => value.split(' ').map(_.capitalize).mkString(" "))
val dfCapitalized = df.columns.foldLeft(df) {
  (df, column) => df.withColumn(column, convertUDF(col(column)))
}
dfCapitalized.show(false)

 --- ------------------- --------- ------- 
|ID |City               |test     |Country|
 --- ------------------- --------- ------- 
|1  |Frankfurt Am Main  |Just Test|Germany|
|2  |Should Do This Also|Test     |France |
 --- ------------------- --------- ------- 

CodePudding user response:

You could map over your Dataframe, and then simply use normal Scala functions to capitalize. This gives you quite some flexibility in which exact transformations you want to do, giving you the Scala language to your disposal.

Something like this:

import spark.implicits._
val df = Seq(
  (1, "Frankfurt am main", "Germany")
).toDF("ID", "City", "Country")

val output = df.map{
  row => (
    row.getInt(0),
    row.getString(1).split(' ').map(_.capitalize).mkString(" "),
    row.getString(2)
  )
}
output.show
 --- ----------------- -------                                                                                                                                                                                                                                                  
| _1|               _2|     _3|                                                                                                                                                                                                                                                 
 --- ----------------- -------                                                                                                                                                                                                                                                  
|  1|Frankfurt Am Main|Germany|                                                                                                                                                                                                                                                 
 --- ----------------- ------- 

Inside of the map function, we're outputting a tuple with the same amount of elements as the amount of columns you want to end up with.

Hope this helps!

  • Related