Question: We are encountering the following error when loading a data file - that has two characters delimiter - to an Azure SQL Db
. What we may be doing wrong and how can the issue be resolved?
Using a Python
notebook in Azure Databricks, we are trying to load a data file into Azure SQL Db. The delimiter in the data file has two characters ~*
. On the following code we get the errors shown below:
pandas dataframe low memory not supported with the 'python' engine
Code:
import sqlalchemy as sq
import pandas as pd
data_df = pd.read_csv('/dbfs/FileStore/tables/MyDataFile.txt', sep='~*', engine='python', low_memory=False, quotechar='"', header='infer' , encoding='cp1252')
.............
.............
Remarks: If we remove the low_memory
option, we get the following error. Although with other data files that are even larger than this file but have delimiter with a single character, we don't get the following error.
ConnectException: Connection refused (Connection refused) Error while obtaining a new communication channel ConnectException error: This is often caused by an OOM error that causes the connection to the Python REPL to be closed. Check your query's memory usage.
CodePudding user response:
Probably your File ist too large and the dataframe does not fit in memory. Can you try to split the Processing Up? I.e. read 1000 Limes, make a dataframe from that, Push to SQL, Thema read next 1000 lines etc?
nrows and skiprows passed to read_csv can be used for this.
Maybe a workaround: preprocess the file with sed s/-*/;/g
, then you can use the c engine with lower memory footprint.
CodePudding user response:
From the documentation of Pandas.read_csv()
:
In addition, separators longer than 1 character and different from '\s ' will be interpreted as regular expressions and will also force the use of the Python parsing engine.
Since it's being interpreted as a regular expression, and *
has special meaning in regexp, you need to escape it. Use sep=r'~\*'