Home > front end >  How to create a new array of substrings from string array column in a spark dataframe
How to create a new array of substrings from string array column in a spark dataframe

Time:03-24

I have a spark dataframe. One of the columns is an array type consisting of an array of text strings of varying lengths. I am looking for a way to add a new column that is an array of the unique left 8 characters of those strings.

df.printSchema()

root
(...)
 |-- arr_agent: array (nullable = true)
 |    |-- element: string (containsNull = true)

example data from column "arr_agent":

["NRCANL2AXXX", "NRCANL2A"]
["UTRONL2U", "BKRBNL2AXXX", "BKRBNL2A"]
["NRCANL2A"]
["UTRONL2U", "REUWNL2A002", "BKRBNL2A", "REUWNL2A", "REUWNL2N"]
["UTRONL2U", "UTRONL2UXXX", "BKRBNL2A"]
["MQBFDEFFYYY", "MQBFDEFFZZZ", "MQBFDEFF"  ]

What I need to have in the new column:

["NRCANL2A"]
["UTRONL2U", "BKRBNL2A"]
["NRCANL2A"]
["UTRONL2U", "BKRBNL2A", "REUWNL2A", "REUWNL2N"]
["UTRONL2U", "BKRBNL2A"]
["MQBFDEFF"  ]

I already tried to define a udf that does it for me.

from pyspark.sql import functions as F
from pyspark.sql import types as T

def make_list_of_unique_prefixes(text_array, prefix_length=8):
    out_arr = set(t[0:prefix_length] for t in text_array)
    return(out_arr)

make_list_of_unique_prefixes_udf = F.udf(lambda x,y=8: make_list_of_unique_prefixes(x,y), T.ArrayType(T.StringType()))

But then calling:

df.withColumn("arr_prefix8s", F.collect_set( make_list_of_unique_prefixes_udf(F.col("arr_agent") ))) 

Throws an error AnalysisException: grouping expressions sequence is empty,

Any tips would be appreciated. thanks

CodePudding user response:

You can solve this using higher order functions available from spark 2.4 using transform and substring and then take array distinct:

from pyspark.sql import functions as F
n = 8
out = df.withColumn("New",F.expr(f"array_distinct(transform(arr_agent,x->substring(x,0,{n})))"))

out.show(truncate=False)

 ----------------------------------------------------- ---------------------------------------- 
|arr_agent                                            |New                                     |
 ----------------------------------------------------- ---------------------------------------- 
|[NRCANL2AXXX, NRCANL2A]                              |[NRCANL2A]                              |
|[UTRONL2U, BKRBNL2AXXX, BKRBNL2A]                    |[UTRONL2U, BKRBNL2A]                    |
|[NRCANL2A]                                           |[NRCANL2A]                              |
|[UTRONL2U, REUWNL2A002, BKRBNL2A, REUWNL2A, REUWNL2N]|[UTRONL2U, REUWNL2A, BKRBNL2A, REUWNL2N]|
|[UTRONL2U, UTRONL2UXXX, BKRBNL2A]                    |[UTRONL2U, BKRBNL2A]                    |
|[MQBFDEFFYYY, MQBFDEFFZZZ, MQBFDEFF]                 |[MQBFDEFF]                              |
 ----------------------------------------------------- ---------------------------------------- 
  • Related