I have created a Spark SQL temp view named old_ticket_df
. In the temp view old_ticket_df
I have a column "approver" which appears to have whitespace. I have tried to remove the whitespace using trim()
, but it seems to have no effect. If I try instead to remove the whitespace using trim(BOTH '\t' from approver)
it appears to remove the whitespace.
I have created a Spark dataframe called compare_df
from the compare_query
below, where I have unioned two examples together applying the two different versions of the trim
function to the same field and value. The output from the code is below.
Can someone please tell me what the difference is between TRIM(approver)
and TRIM(BOTH '\t' FROM approver)
in Spark SQL?
Code:
compare_query="""
select * from (
select 'ver1' as data_source, TRIM(BOTH '\t' FROM approver) as approver
from old_ticket_df
where ticketid='xxxxx'
and ticketlineid='yyyyyyy'
and storeid='00'
and dateclosed='2020-06-06'
union
select 'ver2' as data_source, TRIM(approver) as approver
from old_ticket_df
where ticketid='xxxxx'
and ticketlineid='yyyyyyy'
and storeid='00'
and dateclosed='2020-06-06'
)
"""
compare_df=spark.sql(compare_query)
compare_df.show(truncate=False)
Output:
----------- ------------------------
|data_source|approver |
----------- ------------------------
|ver2 |[email protected] |
|ver1 |[email protected] |
----------- ------------------------
CodePudding user response:
trim(str)
removes only space characters.
trim(BOTH trimStr FROM str)
removes characters which you specify in trimStr
.
Example:
df = spark.createDataFrame([(' SPACES ', ), ('\tTABS\t', )], ['str'])
df.createOrReplaceTempView("sql_df")
spark.sql("""
SELECT str,
TRIM(str) as trim,
TRIM(BOTH '\t' FROM str) as trim_both_tab,
TRIM(BOTH ' \t' FROM str) as trim_both_tab_space
FROM sql_df
""").show()
# -------- -------- ------------- -------------------
#| str| trim|trim_both_tab|trim_both_tab_space|
# -------- -------- ------------- -------------------
#| SPACES | SPACES| SPACES | SPACES|
#|\tTABS\t|\tTABS\t| TABS| TABS|
# -------- -------- ------------- -------------------
- Column
str
is the original with value " SPACES " (having surrounding space characters) and value "\tTABS\t" (having surrounding tab characters). - Column
trim
shows the result oftrim(str)
. Space characters were trimmed, but tab characters remain. - Column
trim_both_tab
shows the result ofTRIM(BOTH '\t' FROM str)
. Tab characters were trimmed, but space characters remain. - Column
trim_both_tab_space
shows the result ofTRIM(BOTH ' \t' FROM str)
. Both space and tab characters were removed, because they both were provided.
In your case, TRIM(approver)
only removed spaces, so in ver2
line you still have tabs remaining. While TRIM(BOTH '\t' FROM approver)
only removed tabs leaving spaces untouched.