Home > Software design >  Alter multiple column comments simultaneously in spark/delta lake
Alter multiple column comments simultaneously in spark/delta lake

Time:05-17

Short version: Need a faster/better way to update many column comments at once in spark/databricks. I have a pyspark notebook that can do this sequentially across many tables, but if I call it from multiple tasks they take so long waiting on a hive connection that I get timeout failures.

Command used: ALTER TABLE my_db_name.my_table_name CHANGE my_column COMMENT "new comment" (docs)

Long version: I have a data dictionary notebook where I maintain column descriptions that are reused across multiple tables. If I run the notebook directly it successfully populates all my database table and column comments by issuing the above command sequentially for every column across all tables (and the corresponding table description command once).

I'm trying to move this to a by-table call. In the databricks tasks that populate the tables I have a check to see if the output table exist. If not it's created, and at the end I call the dictionary notebook (using dbutils.notebook.run("Data Dictionary Creation", 600, {"db": output_db, "update_table": output_table}) to populate the comments for that particular table. If this happens simultaneously for multiple tables however the notebook calls now timeout, as most of the tasks spend a lot of time waiting for client connection with hive. This is true even though there's only one call of the notebook per table.

Solution Attempts:

  1. I tried many variations of the above command to update all column comments in one call per table, but it's either impossible or my syntax is wrong.
  2. It's unclear to me how to avoid the timeout issues (I've doubled timeout to 10 minutes and it still fails, while the original notebook takes much less time than that to run across all tables!). I need to wait for completion before continuing to the next task (or I'd spawn it as a process).

Update: I think what's happening here is that the above Alter command is being called in a loop, and when I schedule a job this loop is being distributed and called in parallel. What I may actually need is a way to call it, or a function in it, without letting the loop be distributed. Is there a way to force sequential execution for a single function?

CodePudding user response:

In the end I found a solution for this issue.

First, the problem seems to have been that the loop with the ALTER command was getting parallelized by spark, and thus firing multiple (conflicting) commands simultaneously on the same table.

The answer to this was two-fold:

  1. Add a .coalesce(1) to the end of the function I was calling with the ALTER line. This limits the function to sequential execution.
  2. Return a newly-created empty dataframe from the function to avoid coalesce-based errors.

Part 2 seems to have been necessary because this command is I think meant to get a result back for aggregation. I couldn't find a way to make it work without that (.repartition(1) had the same issue), so in the end I returned spark.createDataFrame([ (1, "foo")],["id", "label"]) from the function and things then worked.

This gets me to my desired end goal of working through all the alter commands without conflict errors.

It's clunky as hell though; still love improvements or alternative approaches if anyone has one.

CodePudding user response:

If you want to change multiple columns at once, why not recreate the table? (This trick will work only if table 'B' is an external table. Here table 'B' is the 'B'ad table with outdated comments. Table 'A' is the good table with good comments.)

  1. drop table ('B')
  2. create table with required comments ( 'A' )

If this table is NOT external, then you might want to create a view, and start using that. This would enable you to add updated comments without altering the original tables data.

Have you considered using table properties instead of comments?

  • Related