Home > OS >  Differences of TRIM and TRIM BOTH in Spark
Differences of TRIM and TRIM BOTH in Spark

Time:04-01

I have created a Spark SQL temp view named old_ticket_df. In the temp view old_ticket_df I have a column "approver" which appears to have whitespace. I have tried to remove the whitespace using trim(), but it seems to have no effect. If I try instead to remove the whitespace using trim(BOTH '\t' from approver) it appears to remove the whitespace.

I have created a Spark dataframe called compare_df from the compare_query below, where I have unioned two examples together applying the two different versions of the trim function to the same field and value. The output from the code is below.

Can someone please tell me what the difference is between TRIM(approver) and TRIM(BOTH '\t' FROM approver) in Spark SQL?

Code:

compare_query="""
select * from (
select 'ver1' as data_source,  TRIM(BOTH '\t' FROM approver) as  approver
from old_ticket_df
where ticketid='xxxxx'
and ticketlineid='yyyyyyy'
and storeid='00'
and dateclosed='2020-06-06'
union
select 'ver2' as data_source, TRIM(approver) as approver
from old_ticket_df
where ticketid='xxxxx'
and ticketlineid='yyyyyyy'
and storeid='00'
and dateclosed='2020-06-06'
)
"""
compare_df=spark.sql(compare_query)
compare_df.show(truncate=False)

Output:

 ----------- ------------------------ 
|data_source|approver                |
 ----------- ------------------------ 
|ver2       |[email protected]    |
|ver1       |[email protected] |
 ----------- ------------------------ 

CodePudding user response:

trim(str) removes only space characters.
trim(BOTH trimStr FROM str) removes characters which you specify in trimStr.

Spark's trim documentation

Example:

df = spark.createDataFrame([(' SPACES ', ), ('\tTABS\t', )], ['str'])
df.createOrReplaceTempView("sql_df")
spark.sql("""
SELECT str,
       TRIM(str) as trim,
       TRIM(BOTH '\t' FROM str) as trim_both_tab,
       TRIM(BOTH ' \t' FROM str) as trim_both_tab_space
FROM sql_df
""").show()
# -------- -------- ------------- ------------------- 
#|     str|    trim|trim_both_tab|trim_both_tab_space|
# -------- -------- ------------- ------------------- 
#| SPACES |  SPACES|      SPACES |             SPACES|
#|\tTABS\t|\tTABS\t|         TABS|               TABS|
# -------- -------- ------------- ------------------- 
  • Column str is the original with value " SPACES " (having surrounding space characters) and value "\tTABS\t" (having surrounding tab characters).
  • Column trim shows the result of trim(str). Space characters were trimmed, but tab characters remain.
  • Column trim_both_tab shows the result of TRIM(BOTH '\t' FROM str). Tab characters were trimmed, but space characters remain.
  • Column trim_both_tab_space shows the result of TRIM(BOTH ' \t' FROM str). Both space and tab characters were removed, because they both were provided.

In your case, TRIM(approver) only removed spaces, so in ver2 line you still have tabs remaining. While TRIM(BOTH '\t' FROM approver) only removed tabs leaving spaces untouched.

  • Related