I am creating a new attribute x with concatenation of attributes a b c. If the total length of x is less than 10, then I want to prefix only attribute c with 0. How can I do this?
val x = when length(concat($"a", $"b", $"c")) < 10,
concat($"a", $"b", lpad($"c", 10, '0'))
.otherwise(concat($"a", $"b", $"c"))
The above wont work as column c will be prefixed with 0's until length 10, whereas I want the overall length after concat to be 10. Please suggest.
CodePudding user response:
You can use an sql expression for the lpad
:
val df =Seq(("aaa","bbb","cccc"),
("a","b","c"),
("a","b","1234567890123"),
("a","b","")).toDF("a","b","c")
df.withColumn("x", when(length(concat($"a",$"b",$"c")) < 10,
concat($"a", $"b", expr("lpad(c, 10 - char_length(a) - char_length(b), '0')")))
.otherwise(concat($"a",$"b",$"c")))
.show()
Output:
--- --- ------------- ---------------
| a| b| c| x|
--- --- ------------- ---------------
|aaa|bbb| cccc| aaabbbcccc|
| a| b| c| ab0000000c|
| a| b|1234567890123|ab1234567890123|
| a| b| | ab00000000|
--- --- ------------- ---------------
CodePudding user response:
Logically what you want is the following:
val ab = concat($"a", $"b")
val x = concat(ab, lpad($"c", lit(10) - length(ab), lit("0")))
// ^ Not possible ^ Not possible
Unfortunately this isn't possible because the Spark Scala lpad
function has the signature str: Column, len: Int, pad: String
and you can't provide Column
objects as the len
and pad
parameters.
If you examine the underlying Catalyst StringLPad
expression type, it doesn't require constants for the len
and pad
parameters. This means we can define our own version of lpad
that allows for Column values passed as len
and pad
so that the length and the length and padding string can be variable for each row.
import org.apache.spark.sql.Column
import org.apache.spark.sql.functions._
import org.apache.spark.sql.catalyst.expressions.StringLPad
def lpad_var(str: Column, len: Column, pad: Column) =
new Column(StringLPad(str.expr, len.expr, pad.expr))
val ab = concat($"a", $"b")
val x = concat(ab, lpad_var($"c", lit(10) - length(ab), lit("0")))
Here is an example output with some edge cases:
val df = Seq(
("a", "bb", "c"),
("aa", "bbb", "c"),
("", "", "")
).toDF("a", "b", "c")
df.select(x.as("x")).show()
// ----------
// | x|
// ----------
// |abb000000c|
// |aabbb0000c|
// |0000000000|
// ----------