How can I save a pyspark dataframe in databricks to SAP Hana (SAP Data Warehouse Cloud) using hdbcli-CodePudding

I need to push some data from Databricks on AWS to SAP Data Warehouse cloud, and have been encouraged to use the python hdbcli (https://pypi.org/project/hdbcli/). The only documentation I have been able to find is the one in pypi, which is quite scarce. I can see an example of how to push individual rows to a sql table, but I have found no examples of how to save a pyspark dataframe to a table in SAP Data Warehouse cloud.

Documentation examples:

sql = 'INSERT INTO T1 (ID, C2) VALUES (:id, :c2)'
cursor = conn.cursor()
id = 3
c2 = "goodbye"
cursor.execute(sql, {"id": id, "c2": c2})
# returns True
cursor.close()

I have tried the following in my data bricks notebook:

df.createOrReplaceTempView("final_result_local")
sql = "INSERT INTO final_result SELECT * FROM final_result_local"
cursor.execute(sql)
cursor.close()

After this I got the following error:

invalid table name: Could not find table/view FINAL_RESULT_LOCAL in schema DATABRICKS_SCHEMA

It seems df.createOrReplaceTempView created the sql table in a different context to the one called by hdbcli, and I don't know how to push the local table to sap data warehouse cloud. Any help would be much appreciated.

CodePudding user response：

I disagree with using hdbcli. Instead look into connecting from Spark directly, this instruction should be helpful.

CodePudding user response：

You should consider using the Python machine learning client for SAP HANA (hana-ml). You can think of it as being an abstraction layer on top of hdbcli. The central object to send and retrieve data is the HANA dataframe, which behaves similar to a Pandas dataframe, but is persisted on database side (i.e. this can be a table).

For your scenario, you should be able to create a HANA dataframe and thus a table using function create_dataframe_from_spark() (see documentation).

Regarding the direct use of hdbcli, you can find the full documentation here (also linked on PyPi).