Home > Software design >  How to split string to array of characters in Spark?
How to split string to array of characters in Spark?

Time:06-10

How to split string column into array of characters?

Input:

from pyspark.sql import functions as F
df = spark.createDataFrame([('Vilnius',), ('Riga',), ('Tallinn',), ('New York',)], ['col_cities'])
df.show()
#  ---------- 
# |col_cities|
#  ---------- 
# |   Vilnius|
# |      Riga|
# |   Tallinn|
# |  New York|
#  ---------- 

Desired output:

#  ---------- ------------------------ 
# |col_cities|split                   |
#  ---------- ------------------------ 
# |Vilnius   |[V, i, l, n, i, u, s]   |
# |Riga      |[R, i, g, a]            |
# |Tallinn   |[T, a, l, l, i, n, n]   |
# |New York  |[N, e, w,  , Y, o, r, k]|
#  ---------- ------------------------ 

CodePudding user response:

split can be used by providing empty string '' as separator. However, it will return empty string as the last array's element. So then slice is needed to remove the last array's element.

split = "split(col_cities, '')"
split = F.expr(f'slice({split}, 1, size({split})-1)')

df.withColumn('split', split).show(truncate=0)
#  ---------- ------------------------ 
# |col_cities|split                   |
#  ---------- ------------------------ 
# |Vilnius   |[V, i, l, n, i, u, s]   |
# |Riga      |[R, i, g, a]            |
# |Tallinn   |[T, a, l, l, i, n, n]   |
# |New York  |[N, e, w,  , Y, o, r, k]|
#  ---------- ------------------------ 

Sometimes it may be better to have a function:

def split_to_chars(c):
    split = f"split({c}, '')"
    return F.expr(f'slice({split}, 1, size({split})-1)')

df.withColumn('split', split_to_chars('col_cities')).show(truncate=0)
#  ---------- ------------------------ 
# |col_cities|split                   |
#  ---------- ------------------------ 
# |Vilnius   |[V, i, l, n, i, u, s]   |
# |Riga      |[R, i, g, a]            |
# |Tallinn   |[T, a, l, l, i, n, n]   |
# |New York  |[N, e, w,  , Y, o, r, k]|
#  ---------- ------------------------ 

CodePudding user response:

You can use split with regex pattern having negative lookahead:

df.withColumn('split', F.split('col_cities', '(?!$)'))

 ---------- ------------------------ 
|col_cities|split                   |
 ---------- ------------------------ 
|Vilnius   |[V, i, l, n, i, u, s]   |
|Riga      |[R, i, g, a]            |
|Tallinn   |[T, a, l, l, i, n, n]   |
|New York  |[N, e, w,  , Y, o, r, k]|
 ---------- ------------------------ 
  • Related