How to split string column into array of characters?
Input:
from pyspark.sql import functions as F
df = spark.createDataFrame([('Vilnius',), ('Riga',), ('Tallinn',), ('New York',)], ['col_cities'])
df.show()
# ----------
# |col_cities|
# ----------
# | Vilnius|
# | Riga|
# | Tallinn|
# | New York|
# ----------
Desired output:
# ---------- ------------------------
# |col_cities|split |
# ---------- ------------------------
# |Vilnius |[V, i, l, n, i, u, s] |
# |Riga |[R, i, g, a] |
# |Tallinn |[T, a, l, l, i, n, n] |
# |New York |[N, e, w, , Y, o, r, k]|
# ---------- ------------------------
CodePudding user response:
split
can be used by providing empty string ''
as separator. However, it will return empty string as the last array's element. So then slice
is needed to remove the last array's element.
split = "split(col_cities, '')"
split = F.expr(f'slice({split}, 1, size({split})-1)')
df.withColumn('split', split).show(truncate=0)
# ---------- ------------------------
# |col_cities|split |
# ---------- ------------------------
# |Vilnius |[V, i, l, n, i, u, s] |
# |Riga |[R, i, g, a] |
# |Tallinn |[T, a, l, l, i, n, n] |
# |New York |[N, e, w, , Y, o, r, k]|
# ---------- ------------------------
Sometimes it may be better to have a function:
def split_to_chars(c):
split = f"split({c}, '')"
return F.expr(f'slice({split}, 1, size({split})-1)')
df.withColumn('split', split_to_chars('col_cities')).show(truncate=0)
# ---------- ------------------------
# |col_cities|split |
# ---------- ------------------------
# |Vilnius |[V, i, l, n, i, u, s] |
# |Riga |[R, i, g, a] |
# |Tallinn |[T, a, l, l, i, n, n] |
# |New York |[N, e, w, , Y, o, r, k]|
# ---------- ------------------------
CodePudding user response:
You can use split
with regex pattern having negative lookahead:
df.withColumn('split', F.split('col_cities', '(?!$)'))
---------- ------------------------
|col_cities|split |
---------- ------------------------
|Vilnius |[V, i, l, n, i, u, s] |
|Riga |[R, i, g, a] |
|Tallinn |[T, a, l, l, i, n, n] |
|New York |[N, e, w, , Y, o, r, k]|
---------- ------------------------