Azure Databricks OOM error that causes the connection to the Python REPL to be closed-CodePudding

In the following sample code, in one cell of our Azure Databricks notebook, the code loads about 20 million records into a Python pandas dataframe from an Azure SQL db, does some dataframe column tranformation by applying some functions (as shown in the code snippet below). But after running the code for about half an hour, the Databricks throws the following error:

Error:

ConnectException: Connection refused (Connection refused)
Error while obtaining a new communication channel
ConnectException error: This is often caused by an OOM error that causes the connection to the Python REPL to be closed. Check your query's memory usage.

Remarks: Table has about 150 columns. The Spark setting on the Databricks is as follows: Cluster: 128 GB , 16 Cores, DBR 8.3, Spark 8.3, Scala 2.12

Question: What could be a cause of the error, and how can we fix it?

import sqlalchemy as sq
import pandas as pd

def fn_myFunction(lastname):
    testvar = lastname.lower()
    testvar = testvar.strip()
    
    return testvar

pw = dbutils.secrets.get(scope='SomeScope',key='sql')
engine = sq.create_engine('mssql pymssql://SERVICE.Databricks.NONPUBLICETL:' pw 'MyAzureSQL.database.windows.net:1433/TEST', isolation_level="AUTOCOMMIT")

app_df = pd.read_sql('select * from MyTable', con=engine)

#create new column
app_df['NewColumn'] = app_df['TestColumn'].apply(lambda x: fn_myFunction(x))
.............
.............

CodePudding user response：

This means that the driver crashed because of an OOM (Out of memory) exception and after that, it's not able to establish a new connection with the driver.

Please try below options

Try increasing driver-side memory and then retry.
You can look at the spark job dag which give you more info on data flow.

For more information follow this article by Aditi Sinha

CodePudding user response：

First we can understand most common reason for OOM issue with driver node.

One common cause for this error is that the driver is undergoing a memory bottleneck. When this happens, the driver crashes with an out of memory (OOM) condition and gets restarted or becomes unresponsive due to frequent full garbage collection. The reason for the memory bottleneck can be any of the following:

The driver instance type is not optimal for the load executed on the driver.
There are memory-intensive operations executed on the driver.
There are many notebooks or jobs running in parallel on the same cluster.

why this happened in your case ?

When we are processing any datasets through Pandas ,that run operations on a single machine(Driver node) not like PySpark runs on multiple machines. If you are working on dealing with larger datasets, through python pandas,No matter how big the cluster is. So you can increase memory size of driver node or processing in distributed mode .