Home > Enterprise >  Prefix column with zeroes in spark
Prefix column with zeroes in spark

Time:09-17

I am creating a new attribute x with concatenation of attributes a b c. If the total length of x is less than 10, then I want to prefix only attribute c with 0. How can I do this?

val x = when length(concat($"a", $"b", $"c")) < 10, 
        concat($"a", $"b", lpad($"c", 10, '0'))
       .otherwise(concat($"a", $"b", $"c"))

The above wont work as column c will be prefixed with 0's until length 10, whereas I want the overall length after concat to be 10. Please suggest.

CodePudding user response:

You can use an sql expression for the lpad:

val df =Seq(("aaa","bbb","cccc"),
  ("a","b","c"),
  ("a","b","1234567890123"),
  ("a","b","")).toDF("a","b","c")

df.withColumn("x", when(length(concat($"a",$"b",$"c")) < 10,
  concat($"a", $"b", expr("lpad(c, 10 - char_length(a) - char_length(b), '0')")))
  .otherwise(concat($"a",$"b",$"c")))
  .show()

Output:

 --- --- ------------- --------------- 
|  a|  b|            c|              x|
 --- --- ------------- --------------- 
|aaa|bbb|         cccc|     aaabbbcccc|
|  a|  b|            c|     ab0000000c|
|  a|  b|1234567890123|ab1234567890123|
|  a|  b|             |     ab00000000|
 --- --- ------------- --------------- 

CodePudding user response:

Logically what you want is the following:

val ab = concat($"a", $"b")
val x = concat(ab, lpad($"c", lit(10) - length(ab), lit("0")))
//                            ^ Not possible        ^ Not possible

Unfortunately this isn't possible because the Spark Scala lpad function has the signature str: Column, len: Int, pad: String and you can't provide Column objects as the len and pad parameters.

If you examine the underlying Catalyst StringLPad expression type, it doesn't require constants for the len and pad parameters. This means we can define our own version of lpad that allows for Column values passed as len and pad so that the length and the length and padding string can be variable for each row.

import org.apache.spark.sql.Column
import org.apache.spark.sql.functions._
import org.apache.spark.sql.catalyst.expressions.StringLPad

def lpad_var(str: Column, len: Column, pad: Column) =
 new Column(StringLPad(str.expr, len.expr, pad.expr))

val ab = concat($"a", $"b")
val x = concat(ab, lpad_var($"c", lit(10) - length(ab), lit("0")))

Here is an example output with some edge cases:

val df = Seq(
  ("a", "bb", "c"),
  ("aa", "bbb", "c"),
  ("", "", "")
).toDF("a", "b", "c")

df.select(x.as("x")).show()
//  ---------- 
// |         x|
//  ---------- 
// |abb000000c|
// |aabbb0000c|
// |0000000000|
//  ---------- 
  • Related