Home > OS >  UDFs Use Array to Split String in Column
UDFs Use Array to Split String in Column

Time:10-22

I have a df that has two columns. One column is a string and the other is an array of integers.

root
 |-- col1: string (nullable = true)
 |-- col2: array (nullable = true)
 |    |-- element: integer (containsNull = true)

The dataframe looks like:

 -------------------- ------------ 
|                col1|        col2|
 -------------------- ------------ 
|Barkley likes peo...|[22, 22, 25]|
 -------------------- ------------ 

The array actually tells me where I need to break apart the sentences in col1.

If the value then in col1 is "Barkley likes people. Barkley likes treats. Barkley likes everything." The array tells me that from 0-22 chars is the first sentence, from 22 to 44 (22 22) is the second sentence and the last sentence then is from 44(22 22) to 69 (44 25).

I need to avoid sending anything to the driver node and maintain parallelism. My question then is how can I create a udf that leverages the integers in my array to split the sentence in col1? The output could leverage withColumn and return three new columns or a map with each sentence. Can I do this without a for loop, list comprehension, collect() or select()?

CodePudding user response:

For Spark version >= 2.4, we can utilize higher-order functions to work with arrays, including this problem. Let's say df is the dataframe.

df = spark.createDataFrame([
           ("Barkley likes people. Barkley likes treats. Barkley likes everything.",[22, 22, 25]),
           ("A sentence. Another sentence.",[13, 18]),
           ("One sheep. Two sheep. Three sheep. Four sheep.",[11, 12, 13, 12])], 
           "col1:string, col2:array<int>")
df.show()
#  -------------------- ---------------- 
# |                col1|            col2|
#  -------------------- ---------------- 
# |Barkley likes peo...|    [22, 22, 25]|
# |A sentence. Anoth...|        [13, 18]|
# |One sheep. Two sh...|[11, 12, 13, 12]|
#  -------------------- ---------------- 

To slice the sentences from col1, substring function will used and it needs arguments of start position and length. col2 is the lengths of every sentences in the string. The start positions of every sentences are the cumulative sum of array col2 from 0 to n-1, as hinted in the question. To get that, use higher-order functions transform and aggregate. After that, get every sentences and use map_from_entries to create a map for every sentences and their indexes. This is an example to do so.

import pyspark.sql.functions as F

df = (df
      .withColumn("start", F.expr("transform(transform(col2, (v1,i) -> slice(col2, 1, i)), v2 -> aggregate(v2, 0, (a,b) -> a   b))"))
      .withColumn("sentences", F.expr("transform(col2, (v, i) -> struct(i 1 as index, substring(col1, start[i], col2[i]) as sentence))"))
      .selectExpr("col1", "map_from_entries(sentences) as sentences")
      )
df.show(truncate=False)
#  --------------------------------------------------------------------- ------------------------------------------------------------------------------------------ 
# |col1                                                                 |sentences                                                                                 |
#  --------------------------------------------------------------------- ------------------------------------------------------------------------------------------ 
# |Barkley likes people. Barkley likes treats. Barkley likes everything.|[1 -> Barkley likes people. , 2 ->  Barkley likes treats., 3 ->  Barkley likes everything]|
# |A sentence. Another sentence.                                        |[1 -> A sentence. A, 2 -> Another sentence.]                                              |
# |One sheep. Two sheep. Three sheep. Four sheep.                       |[1 -> One sheep. , 2 ->  Two sheep. , 3 -> Three sheep. , 4 -> Four sheep.]               |
#  --------------------------------------------------------------------- ------------------------------------------------------------------------------------------ 
  • Related