Home > other >  Spark-Scala: How to compare dates in strings of varying length and return minimum value?
Spark-Scala: How to compare dates in strings of varying length and return minimum value?

Time:10-02

I have dataset of the following type:

userid  column  
1004    value=bb5469c5|2021-09-19 01:25:30,4f0d-bb6f-43cf552b9bc6|2021-09-25 05:12:32,1954f0f|2021-09-19 01:27:45,4395766ae|2021-09-19 01:29:13,
1018.   value=36ba7a7|2021-09-19 01:33:00,
1020    value=23fe40-4796-ad3d-6d5499b|2021-09-19 01:38:59,77a90a1c97b|2021-09-19 01:34:53,
1022.   value=3623fe40|2021-09-19 01:33:00,
1028    value=6c77d26c-6fb86|2021-09-19 01:50:50,f0ac93b3df|2021-09-19 01:51:11,
1032    value=ac55-4be82f28d|2021-09-19 01:54:20,82229689e9da|2021-09-23 01:19:47,

I have to parse the value part each of the strings and return the minimum date from them. The strings are of variable/dynamic lengths and there is no set limit to any value string.

I tried passing them in a function:

val strFun = (str: String) =>{
    if (str != null) {
      val str_split = str.replaceAll("""\|""",",").split(",")
      }
    }
    (str_split(0),str_split(1))
  };

But the constraint I'm stuck on is to determine the number of dates to be returned for comparison and to pick minimum date from them. How can this be done?

CodePudding user response:

First split the string. The result is an string array. Using transform from each element of the array the date string is extracted (using regexp_extract) and converted into a timestamp. Now we have an array of timestamps from which we take the minimal value.

import org.apache.spark.sql.functions._
df.withColumn("min_date", split('column, ","))
  .withColumn("min_date", array_min(transform('min_date,
      c => to_timestamp(regexp_extract(c, "\\|(.*)$", 1)))))
  .show()

Output:

 ------ -------------------- ------------------- 
|userid|              column|           min_date|
 ------ -------------------- ------------------- 
|  1004|value=bb5469c5|20...|2021-09-19 01:25:30|
| 1018.|value=36ba7a7|202...|2021-09-19 01:33:00|
|  1020|value=23fe40-4796...|2021-09-19 01:34:53|
| 1022.|value=3623fe40|20...|2021-09-19 01:33:00|
|  1028|value=6c77d26c-6f...|2021-09-19 01:50:50|
|  1032|value=ac55-4be82f...|2021-09-19 01:54:20|
 ------ -------------------- ------------------- 
  • Related