Home > OS >  Remove the last element in an array whose length is less than a number Pyspark dataframe
Remove the last element in an array whose length is less than a number Pyspark dataframe

Time:06-20

I have dataframe:

columns = ['id', 'text']
vals = [
        (1, 'I am good or'),
        (2, 'You are okey'),
        (3, 'She is fine in')
       ]

df = spark.createDataFrame(vals, columns)
 --- -------------- 
| id|          text|
 --- -------------- 
|  1|  I am good or|
|  2|  You are okey|
|  3|She is fine in|
 --- -------------- 

I want to remove the last word only if it is less than length 3. I tried to do this, but I don't understand how to impose a condition so that the length check is only for the last token. I want to set a non-fixed length condition.

df.withColumn('split_text', F.split(F.col("text")," "))\
  .withColumn("split_text", F.expr("filter(split_text, x -> not(length(x) < 3))"))

I expect this result:

 --- ------------ 
| id|        text|
 --- ------------ 
|  1|  I am good |
|  2|You are okey|
|  3|She is fine |
 --- ------------ 

CodePudding user response:

df.withColumn(
 'split_text', 
 f.split(f.col("text")," "))\
.withColumn(
 "split_text", 
 f.expr("if (length( split_text[size(split_text) - 1])<3, slice( split_text, 1 ,size(split_text) -1 ) ,split_text)")).show()

| id|          text|      split_text|
 --- -------------- ---------------- 
|  1|  I am good or|   [I, am, good]|
|  2|  You are okey|[You, are, okey]|
|  3|She is fine in| [She, is, fine]|
 --- -------------- ---------------- 

expr explained in detail:
if (
 length( split_text[size(split_text) - 1]) < 3, #conditional (Does the last item in the array have a length less than 3?
  slice( # true condition
   split_text, # array to splice
   1 , #int - start of splice to keep
   size(split_text) - 1 ) , #int  - end of splice to keep
  split_text #false condition
)
  • Related