sql query results to pandas df within databricks notebook-CodePudding

I have a sql query results that I would like to convert into a pandas df within the databricks notebook. The query is pulling data from the dbx tables, if this is important to know.

My query looks something like this:

With cte1 as (
   SELECT col1, col2 FROM Table1),
cte2 as(
   SELECT col10, col12 FROM Table2)

SELECT * FROM cte1 JOIN cte2 on col1=col10

I got the results that I am looking for, then I want to convert this into a pandas df while within databricks. The last one I tried is using the from pyspark.sql import SQLContext after my last googling, though there is nothing specific to my intention that I can find, but it throws a sql error.

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

df = sqlContext.sql("""
With cte1 as (
   SELECT col1, col2 FROM Table1),
cte2 as(
   SELECT col10, col12 FROM Table2)

SELECT * FROM cte1 JOIN cte2 on col1=col10
"""
)

Is there a way to convert the sql query results into a pandas df within databricks notebook?

CodePudding user response：

Have you tried utilizing the spark dataframe instead of pandas df? Somehow the two share some common functions. With spark df, you are still utilizing the power of spark within databricks instead of pandas where the df will be using only your computer's cores, which might return a memory error espceially if you are working with xx GB of data. To do that, what worked for is to create the table as usual while you can directly use your query as the source of the table you will create.

CREATE TABLE newTable as 
    SELECT * FROM
       ( SELECT col1, col2 FROM Table1 ) t1 
    JOIN
       ( SELECT col10, col12 FROM Table2) t2
    ON col1=col10

the query above will say there is no output, but because you only created a table. Then run the following to create a spark dataframe:

dataframe = sqlContext.sql('select * from newTable')

then use the spark functions to perform your analysis. Reminder, if your databricks notebook is defaulted to other languages but Python, make sure to always run your command cells using the magic command %python. You can start with dataframe.printSchema() which is like the pd.info(), dataframe.columns to list all columns, dataframe.show(5) to list 5 results, and so on.