get a value from azure databricks database column from all tables-CodePudding

I am trying to get the min and max date from each table in a database in databricks.

I already started with the following:

results =[]

tables_list = spark.sql('show tables in table')

for row in tables_list.collect():
  tablename = row.tableName
  min_update = spark.sql(f"SELECT MIN(date) FROM table.{tablename}").collect()[0][0]
  max_update = spark.sql(f"SELECT MAX(date) FROM table.{tablename}").collect()[0][0]

results.append((tablename,min_update,max_update))

df_update = spark.createDataFrame(results, schema = ['tablename', 'min_update', 'max_update'])

but when I df_update.display() I only get the last table where I expect to have 60 tables each with the max and min dates for each row of table.

CodePudding user response：

Since you want to get the max and min date from all the tables, you need to include the append operation inside the for loop.
Look at the following where I have modified the same code and got the required results. The following is a dataframe after executing show tables:

results =[]
tables_list = spark.sql('show tables in default')
tables_list.show()

enter image description here

Now I have used the append operation on result array inside the loop and got expected result.

for row in tables_list.collect():
    tablename = row.tableName
    #default is the name of my database and dt is date column
    min_update = spark.sql(f"SELECT MIN(dt) FROM default.{tablename}").collect()[0][0]
    max_update = spark.sql(f"SELECT MAX(dt) FROM default.{tablename}").collect()[0][0]

    results.append((tablename,min_update,max_update))

df_update = spark.createDataFrame(results, schema = ['tablename', 'min_update', 'max_update'])
df_update.display()

enter image description here