want to create a new column based on a string column that have as separator(" ") and skip the split if a digit followed and finally delete ";" in the end if exist using python/pyspark :
Inputs :
"511 520 NA 611;"
"322 GA 620"
"3 321;"
"334344"
expected Output :
Column | new column
"511 520 NA 611;" | [511,520,NA 611]
"322 GA 620" | [322,GA 620]
"3 321; " | [3,321]
"334 344" | [334,344]
try :
data = data.withColumn(
"newcolumn",
split(col("column"), "\s"))
but i get an empty string at the end of the array like here and i want to delete it if exist
Column | new column
"511 520 NA 611;" | [511,520,NA,611;]
"322 GA 620" | [322,GA,620]
"3 321;" | [3,321;]
"334 344" | [334,344]
CodePudding user response:
You can use regexp_replace to replace the ";" at the end of the string first, and then execute split. Regular expression ";$" indicates that match the string ends with ";".
from pyspark.sql import SparkSession
from pyspark.sql.functions import split, col, regexp_replace
spark = SparkSession.builder.getOrCreate()
data = [
("511 520 NA 611;",),
("322 GA 620",),
("3 321;",),
("334 344",)
]
df = spark.createDataFrame(data, ['column'])
df = df.withColumn("newcolumn", split(regexp_replace(col("column"), ';$', ''), "\\s"))
df.show(truncate=False)
CodePudding user response:
As mentioned in the commends you can use regexp_extract_all together with the right regexp as shown below:
from pyspark.sql import functions as F
data = [
["511 520 NA 611;"],
["322 GA 620"],
["3 321;"],
["334344"]
]
df = spark.createDataFrame(data, ["value"])
df.withColumn("extracted_value", F.expr("regexp_extract_all(value, '(\\\d )|(\\\w \\\s\\\d )', 0)")).show()
# --------------- ------------------
# | value| extracted_value|
# --------------- ------------------
# |511 520 NA 611;|[511, 520, NA 611]|
# | 322 GA 620| [322, GA 620]|
# | 3 321;| [3, 321]|
# | 334344| [334344]|
# --------------- ------------------