Home > Mobile >  Python notebook - Importing data file with delimiter with two characters causes an error
Python notebook - Importing data file with delimiter with two characters causes an error

Time:02-24

Question: We are encountering the following error when loading a data file - that has two characters delimiter - to an Azure SQL Db. What we may be doing wrong and how can the issue be resolved?

Using a Python notebook in Azure Databricks, we are trying to load a data file into Azure SQL Db. The delimiter in the data file has two characters ~*. On the following code we get the errors shown below:

pandas dataframe low memory not supported with the 'python' engine

Code:

import sqlalchemy as sq
import pandas as pd

data_df = pd.read_csv('/dbfs/FileStore/tables/MyDataFile.txt', sep='~*', engine='python', low_memory=False, quotechar='"', header='infer' , encoding='cp1252')
.............
.............

Remarks: If we remove the low_memory option, we get the following error. Although with other data files that are even larger than this file but have delimiter with a single character, we don't get the following error.

ConnectException: Connection refused (Connection refused) Error while obtaining a new communication channel ConnectException error: This is often caused by an OOM error that causes the connection to the Python REPL to be closed. Check your query's memory usage.

CodePudding user response:

Probably your File ist too large and the dataframe does not fit in memory. Can you try to split the Processing Up? I.e. read 1000 Limes, make a dataframe from that, Push to SQL, Thema read next 1000 lines etc?

nrows and skiprows passed to read_csv can be used for this.

Maybe a workaround: preprocess the file with sed s/-*/;/g, then you can use the c engine with lower memory footprint.

CodePudding user response:

From the documentation of Pandas.read_csv():

In addition, separators longer than 1 character and different from '\s ' will be interpreted as regular expressions and will also force the use of the Python parsing engine.

Since it's being interpreted as a regular expression, and * has special meaning in regexp, you need to escape it. Use sep=r'~\*'

  • Related