Home > OS >  Pandas: faster string operations in dataframes
Pandas: faster string operations in dataframes

Time:11-28

I am working on a python script that read data from a database and save this data into a .csv file. In order to save it correctly I need to escape different characters such as \r\n or \n. Here is how I am currently doing it:

Firstly, I use the read_sql pandas function in order to read the data from the database.

import pandas as pd

df = pd.read_sql(
    sql = 'SELECT * FROM exampleTable',
    con = SQLAlchemyConnection
)

The table I get has different types of values.

Then, the script updates the dataframe obtained changing every string value to raw string. In order to achive that I use two nested for loops in order to operate with every single value.

def update_df(df)
    for rowIndex, row in df.iterrows():
        for colIndex, values in row.items():
           if isinstance(df[rowIndex, colIndex], str):
               df.at[rowIndex, colIndex] = repr(df.at[rowIndex, colIndex])
    return df

However, the amount of data I need to elaborate is large (more than 1 million rows with more than 100 columns) and it takes hours.

What I need is a way to create the csv file in a faster way.

Thank you in advance.

CodePudding user response:

It should be faster to use applymap if really you have mixed types:

df = df.applymap(lambda x: repr(x) if isinstance(x, str) else x)

However, if you can identify string columns, then you can slice them, (maybe in combination with re.escape?).:

import re
str_cols = ['col1', 'col2']
df[str_cols] = df[str_cols].applymap(re.escape)
  • Related