Home > Enterprise >  Decoding Base64 in Spark Scala
Decoding Base64 in Spark Scala

Time:12-15

I have created the following DataFrame:

val data = spark.sparkContext.parallelize(Seq(("SnVsZXMgTmV3b25l"), ("Jason Kidd"), ("TXIgUm9uYWxkIE0=")))
val df_data = data.toDF()
val decoded_got = df_data.withColumn("xxx", unbase64(col("value")).cast("String"))

And I get the following:

 ---------------- ------------ 
|name            |xxx         |
 ---------------- ------------ 
|SnVsZXMgTmV3b25l|Jules Newone|
|Jason Kidd      |%�(���   |
|TXIgUm9uYWxkIE0=|Mr Ronald M |
 ---------------- ------------ 

What I want to do is avoid the values of the column name that are not in base 64. For example, get the following Df:

 ---------------- ------------ 
|name            |xxx         |
 ---------------- ------------ 
|SnVsZXMgTmV3b25l|Jules Newone|
|Jason Kidd      |Jason Kidd  |
|TXIgUm9uYWxkIE0=|Mr Ronald M |
 ---------------- ------------ 

I am trying something like this but is not working for me:

val regex1 = """^([A-Za-z0-9 /]{4})*([A-Za-z0-9 /]{3}=|[A-Za-z0-9 /]{2}==)?$"""
val check = df_data.withColumn("xxx", when(regex1 matches col("value"), unbase64(col("value"))).otherwise(col("value")))

Is there an option in Spark Scala to check if the value is in base64 or how could I do this?

CodePudding user response:

To check whether the value is a valid base64 encoded string or not, you can try to decode it and encode it again, you should get the initial value. If not, then it's not a base64 string:

val decoded_got = df_data.withColumn(
  "xxx",
  when(
    base64(unbase64(col("value"))) === col("value"),
    unbase64(col("value")).cast("string")
  ).otherwise(col("value"))
)

decoded_got.show
// ---------------- ------------ 
//|           value|         xxx|
// ---------------- ------------ 
//|SnVsZXMgTmV3b25l|Jules Newone|
//|      Jason Kidd|  Jason Kidd|
//|TXIgUm9uYWxkIE0=| Mr Ronald M|
// ---------------- ------------ 
  • Related