Python notebook - Importing data file with delimiter with two characters causes an error-CodePudding

Question: We are encountering the following error when loading a data file - that has two characters delimiter - to an Azure SQL Db. What we may be doing wrong and how can the issue be resolved?

Using a Python notebook in Azure Databricks, we are trying to load a data file into Azure SQL Db. The delimiter in the data file has two characters ~*. On the following code we get the errors shown below:

pandas dataframe low memory not supported with the 'python' engine

Code:

import sqlalchemy as sq
import pandas as pd

data_df = pd.read_csv('/dbfs/FileStore/tables/MyDataFile.txt', sep='~*', engine='python', low_memory=False, quotechar='"', header='infer' , encoding='cp1252')
.............
.............

Remarks: If we remove the low_memory option, we get the following error. Although with other data files that are even larger than this file but have delimiter with a single character, we don't get the following error.

ConnectException: Connection refused (Connection refused) Error while obtaining a new communication channel ConnectException error: This is often caused by an OOM error that causes the connection to the Python REPL to be closed. Check your query's memory usage.

CodePudding user response：

Probably your File ist too large and the dataframe does not fit in memory. Can you try to split the Processing Up? I.e. read 1000 Limes, make a dataframe from that, Push to SQL, Thema read next 1000 lines etc?

nrows and skiprows passed to read_csv can be used for this.

Maybe a workaround: preprocess the file with sed s/-*/;/g, then you can use the c engine with lower memory footprint.

CodePudding user response：

From the documentation of Pandas.read_csv():

In addition, separators longer than 1 character and different from '\s ' will be interpreted as regular expressions and will also force the use of the Python parsing engine.

Since it's being interpreted as a regular expression, and * has special meaning in regexp, you need to escape it. Use sep=r'~\*'