Home > Mobile >  How to truncate a table in PySpark?
How to truncate a table in PySpark?

Time:07-05

In one of my projects, I need to check if an input dataframe is empty or not. If it is not empty, I need to do a bunch of operations and load some results into a table and overwrite the old data there.

On the other hand, if the input dataframe is empty, I do nothing and simply need to truncate the old data in the table. I know how to insert data in with overwrite but don't know how to truncate table only. I searched existing questions/answers and no clear answer found.

driver = 'com.microsoft.sqlserver.jdbc.SQLServerDriver'
stage_url = 'jdbc:sqlserver://server_name\DEV:51433;databaseName=project_stage;user=xxxxx;password=xxxxxxx'

if input_df.count()>0:
    # Do something here to generate result_df

    print(" write to table ")
    write_dbtable = 'Project_Stage.StageBase.result_table'
    write_df = result_df
    write_df.write.format('jdbc').option('url', stage_url).option('dbtable', write_dbtable). \
        option('truncate', 'true').mode('overwrite').option('driver',driver).save()
else:
    print('no account to process!')
    query = """TRUNCATE TABLE Project_Stage.StageBase.result_table"""

    ### Not sure how to run the query

CodePudding user response:

Truncating is probably easiest done like this:

write_df = write_df.limit(0)

Also, for better performance, instead of input_df.count() > 0 you should use

  • Spark 3.2 and below: len(input_df.head(1)) > 0

  • Spark 3.3 : ~df.isEmpty()

  • Related