Home > database >  get a value from azure databricks database column from all tables
get a value from azure databricks database column from all tables

Time:01-11

I am trying to get the min and max date from each table in a database in databricks.

I already started with the following:

results =[]

tables_list = spark.sql('show tables in table')

for row in tables_list.collect():
  tablename = row.tableName
  min_update = spark.sql(f"SELECT MIN(date) FROM table.{tablename}").collect()[0][0]
  max_update = spark.sql(f"SELECT MAX(date) FROM table.{tablename}").collect()[0][0]

results.append((tablename,min_update,max_update))

df_update = spark.createDataFrame(results, schema = ['tablename', 'min_update', 'max_update'])

but when I df_update.display() I only get the last table where I expect to have 60 tables each with the max and min dates for each row of table.

CodePudding user response:

  • Since you want to get the max and min date from all the tables, you need to include the append operation inside the for loop.

  • Look at the following where I have modified the same code and got the required results. The following is a dataframe after executing show tables:

results =[]
tables_list = spark.sql('show tables in default')
tables_list.show()

enter image description here

  • Now I have used the append operation on result array inside the loop and got expected result.
for row in tables_list.collect():
    tablename = row.tableName
    #default is the name of my database and dt is date column
    min_update = spark.sql(f"SELECT MIN(dt) FROM default.{tablename}").collect()[0][0]
    max_update = spark.sql(f"SELECT MAX(dt) FROM default.{tablename}").collect()[0][0]

    results.append((tablename,min_update,max_update))

df_update = spark.createDataFrame(results, schema = ['tablename', 'min_update', 'max_update'])
df_update.display()

enter image description here

  • Related