I have a column with arrays of strings, e.g. like this:
["test.a" "random.ac"]
["test.41" "random.23" "test.123"]
I want to get only the text before the ".". I did it only for the fist element of the array. How do I do it to all elements? Preferably without UDFs.
df = df.withColumn("address", substring_index(col("all_addresses").getItem(0), ".", 1)))
CodePudding user response:
Use regexp_extract. extra alphanumerics
matches if only followed by special character .
--- ------------------------------
|id |text |
--- ------------------------------
|1 |[test.a, random.ac] |
|2 |[test.41, random.23, test.123]|
--- ------------------------------
df.withColumn('new', expr("transform(text,x->regexp_extract(x,'[a-z0-9] (?=\.)',0))")).show()
--- -------------------- --------------------
| id| text| new|
--- -------------------- --------------------
| 1| [test.a, random.ac]| [test, random]|
| 2|[test.41, random....|[test, random, test]|
--- -------------------- --------------------
CodePudding user response:
I would use a similar idea as @wwnde - transform
function. transform
takes an array, transforms every its element according to the provided function and results in the array of the same size, but with changed elements. Exactly what you need.
However, having the same original idea, I would probably implement it differently.
2 options:
from pyspark.sql import functions as F
df = spark.createDataFrame(
[(["test.a", "random.ac"],),
(["test.41", "random.23", "test.123"],)],
['c1']
)
df = df.withColumn('c2', F.transform('c1', lambda x: F.element_at(F.split(x, '\.'), 1)))
df = df.withColumn('c3', F.transform('c1', lambda x: F.regexp_extract(x, r'(. )\.', 1)))
df.show()
# -------------------- -------------------- --------------------
# | c1| c2| c3|
# -------------------- -------------------- --------------------
# | [test.a, random.ac]| [test, random]| [test, random]|
# |[test.41, random....|[test, random, test]|[test, random, test]|
# -------------------- -------------------- --------------------