Home > front end >  In Spark Scala, how to check how many characters in a string in a dataframe column are uppercase?
In Spark Scala, how to check how many characters in a string in a dataframe column are uppercase?

Time:05-05

I am using Spark Scala on Databricks. I have a dataframe with a column, and all values in that column are strings. I want to create another column in which I will have the number of uppercase letters in the original column. An example would be like this:

val df = Seq(("Java",1), ("2000",0), ("python",0), ("ScAlA",3), ("AuguST/Car",4)).toDF("list", "qty_uppercase")

Which gives:

enter image description here

I don't know how to do this.

I have tried splitting the strings in the column "list" using the following command:

.withColumn("list_split", split($"list",""))

The result is below. But then I can't find the right way to iterate through each character of the new column.

enter image description here

For example, I have tried what is mentioned in this other question, and create a column using exists, but it doesn't work:

>>> .withColumn("count", $"list".exists(_.isUpper))

>>> error: value exists is not a member of org.apache.spark.sql.ColumnName
.withColumn("qty", $"list".exists(_.isUpper))

CodePudding user response:

One way is to split only by upper-case letters and count splits:

val df = Seq(("Java",1), ("2000",0), ("python",0), ("ScAlA",3), ("AuguST/Car",4)).toDF("list", "qty_uppercase")
df.withColumn("count", size(split($"list", "[A-Z]")) - 1).show()

which matches the expected counts:

 ---------- ------------- ----- 
|      list|qty_uppercase|count|
 ---------- ------------- ----- 
|      Java|            1|    1|
|      2000|            0|    0|
|    python|            0|    0|
|     ScAlA|            3|    3|
|AuguST/Car|            4|    4|
 ---------- ------------- ----- 
  • Related