Looking for a way to speed up this apply function in pandas-CodePudding

I am currently working with a csv file that contains columns that have both na and list values. When I read the data in, the list values become string values i.e. [1, 2, 3] becomes "[1, 2, 3]". I have tried reading in the data with converters using ast.literal_eval but unfortunately this method doesn't work due to na values being present in the data.

I solved this issue by applying the following function to the columns that had this issue.

import numpy as np

def string_to_list(row_value):
    if row_value is np.nan:
        return np.nan

    else:
        return eval(row_value)

The problem now is that there are over 200k rows in the dataframe and multiple columns have this issue so this method will take a lot of time (~1.5s for each column).

Below is a reproducible example

import pandas as pd

column1 = [np.nan, "[1, 2, 3]", "[3, 2, 1]"] * 100000
df = pd.DataFrame(
    {
        'column1': column1
    }
)

df['column1'].apply(lambda x: string_to_list(x))

After some research, I found a lot of posts saying that vectorization is a faster method of applying a function to a column but I'm not sure how to apply it in my case. I've tried the following but I'm not sure how to specify to run the function to indices where the conditional statement is false.

np.where(
    pd.isna(x),
    x,
    string_to_list(x) #Confused on how to pass argument in here where the conditional statements results to False
)

Any advice is appreciated, thanks!

CodePudding user response：

In your case do ast

import ast 

df['column1'] = df['column1'].map(lambda x : ast.literal_eval(x)  if x == x else x)

CodePudding user response：

There's some discussion about this topic here, which suggests that using the following (more unreadable) function is faster:

def string_to_list2(row_value):
    if row_value is np.nan:
        return np.nan
    else:
        return list(map(str.strip, row_value.strip('][').replace('"','').split(',')))

It does seem to be the case when I recreated your df with 1000 rows (thanks to @HarryPlotter's suggestions too in the comments):

%%timeit 
df['column1'].apply(string_to_list)
# 1000 loops, best of 5: 1.75 ms per loop

%%timeit 
df['column1'].apply(string_to_list2)
# 1000 loops, best of 5: 555 µs per loop