want to create a new column based on a string column that have as separator(";") and delete (";") in the end if exist using python/pyspark :
Inputs :
"511;520;611;"
"322;620"
"3;321;"
"334;344"
expected Output :
Column | new column
"511;520;611;" | [511,520,611]
"322;620" | [322,620]
"3;321;" | [3,321]
"334;344" | [334,344]
try :
data = data.withColumn(
"newcolumn",
split(col("column"), ";"))
but i get an empty string at the end of the array like here and i want to delete it if exist
Column | new column
"511;520;611;" | [511,520,611,empty string]
"322;620" | [322,620]
"3;321;" | [3,321,empty string]
"334;344" | [334;344]
CodePudding user response:
Use strip()
which will remove ;
from the start and end of string
df.column.str.strip(";").str.split(";")
Or using apply lambda
:
df.column.str.split(';').apply(lambda x: [e for e in x if e!=""])
CodePudding user response:
for spark version >= 2.4, use filter
function with != ''
condition to filter out empty strings in an array
from pyspark.sql.functions import expr
data = data.withColumn("newcolumn", expr("filter(split(column, ';'), x -> x != '')"))