I have dataframe:
columns = ['id', 'text']
vals = [
(1, 'I am good or'),
(2, 'You are okey'),
(3, 'She is fine in')
]
df = spark.createDataFrame(vals, columns)
--- --------------
| id| text|
--- --------------
| 1| I am good or|
| 2| You are okey|
| 3|She is fine in|
--- --------------
I want to remove the last word only if it is less than length 3. I tried to do this, but I don't understand how to impose a condition so that the length check is only for the last token. I want to set a non-fixed length condition.
df.withColumn('split_text', F.split(F.col("text")," "))\
.withColumn("split_text", F.expr("filter(split_text, x -> not(length(x) < 3))"))
I expect this result:
--- ------------
| id| text|
--- ------------
| 1| I am good |
| 2|You are okey|
| 3|She is fine |
--- ------------
CodePudding user response:
df.withColumn(
'split_text',
f.split(f.col("text")," "))\
.withColumn(
"split_text",
f.expr("if (length( split_text[size(split_text) - 1])<3, slice( split_text, 1 ,size(split_text) -1 ) ,split_text)")).show()
| id| text| split_text|
--- -------------- ----------------
| 1| I am good or| [I, am, good]|
| 2| You are okey|[You, are, okey]|
| 3|She is fine in| [She, is, fine]|
--- -------------- ----------------
expr explained in detail:
if (
length( split_text[size(split_text) - 1]) < 3, #conditional (Does the last item in the array have a length less than 3?
slice( # true condition
split_text, # array to splice
1 , #int - start of splice to keep
size(split_text) - 1 ) , #int - end of splice to keep
split_text #false condition
)