In Spark Scala, how to create a column with substring() using locate() as a parameter?-CodePudding

I have a dataset that is like the following:

val df = Seq("samb id 12", "car id 13", "lxu id 88").toDF("list")

I want to create a column that will be a string containing only the values after Id. The result would be something like:

val df_result = Seq(("samb id 12",12), ("car id 13",13), ("lxu id 88",88)).toDF("list", "id_value")

For that, I am trying to use substring. For the the parameter of the starting position to extract the substring, I am trying to use locate. But it gives me an error saying that it should be an Int and not a column type.

What I am trying is like:

df
.withColumn("id_value", substring($"list", locate("id", $"list")   2, 2))

The error I get is:

error: type mismatch;
 found   : org.apache.spark.sql.Column
 required: Int
.withColumn("id_value", substring($"list", locate("id", $"list")   2, 2))
                                                                 ^

How can I fix this and continue using locate() as a parameter?

UPDATE Updating to give an example in which @wBob answer doesn't work for my real world data: my data is indeed a bit more complicated than the examples above.

It is something like this:

val df = Seq(":option car, lorem :ipsum: :ison, ID R21234, llor ip", "lst ID X49329xas ipsum :ion: ip_s-")

The values are very long strings that don't have a specific pattern.

Somewhere in the string that is always a part written ID XXXXX. The XXXXX varies, but it is always the same size (5 characters) and always after a ID .

I am not being able to use neither split nor regexp_extract to get something in this pattern.

CodePudding user response：

It is not clear if you want the third item or the first number from the list, but here are a couple of examples which should help:

// Assign sample data to dataframe
val df = Seq("samb id 12", "car id 13", "lxu id 88").toDF("list")

df
    .withColumn("t1", split($"list", "\\ ")(2))
    .withColumn("t2", regexp_extract($"list", "\\d ", 0))
    .withColumn("t3", regexp_extract($"list", "(id )(\\d )", 2))
    .withColumn("t4", regexp_extract($"list", "ID [A-Z](\\d{5})", 1))
    .show()

You can use functions like split and regexp_extract with withColumn to create new columns based on existing values. split splits out the list into an array based on the delimiter you pass in. I have used space here, escaped with two slashes to split the array. The array is zero-based hence specifying 2 gets the third item in the array. regexp_extract uses regular expressions to extract from strings. here I've used \\d which represents digits and which matches the digit 1 or many times. The third column, t3, again uses regexp_extract with a similar RegEx expression, but using brackets to group up sections and 2 to get the second group from the regex, ie the (\\d ). NB I'm using additional slashes in the regex to escape the slashes used in the \d.

My results:

If your real data is more complicated please post a few simple examples where this code does not work and explain why.