Home > Software engineering >  Efficiency of multiple chained str transformation and alternatives
Efficiency of multiple chained str transformation and alternatives

Time:05-06

I'm wanting to change a dataframe column so the values are lower case and also have their whitespace stripped.

For this I used chained str transformations.

df.loc[:, column] = df.loc[:, column].str.lower().str.strip()

The above snippet works, but it looks quite messy as I have to use .str. twice - is there a better/more efficient solution?

CodePudding user response:

You can use a list comprehension:

df['col2'] = [x.lower().strip() for x in df['col']]

Doing this can be faster than chaining multiple str:

%%timeit
df['col2'] = df['col'].str.strip().str.lower()
# 344 ms ± 12.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
df['col2'] = [x.lower().strip() for x in s]
# 182 ms ± 3.13 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

used input (1M rows):

df = pd.DataFrame({'col': ['  aBc  DeF    ']*1000000})

NB. I used strip before lower in the comparison as this is faster than lower, then strip.

non string values in the input

If there are non string values, the list comprehension will fail.

There are several possibilities to handle this.

Using a systematic check:
df['col2'] = [x if pd.isna(x) else x.lower().strip() for x in df['col']]

this will however add an extra burden and will probably be slower compared to only two chained str (but might still be faster than more than 2 chained str).

Using a custom function and try/except:

try/except has a very low cost if no error is triggered, which might make it a viable solution when the number of non-string values is low. In this case one would use:

def f(x):
    try:
        return x.lower().strip()
    except:  # add exhaustive expected exceptions if known
        return x

df['col2'] = [f(x) for x in df['col']]

timings (1M rows with 1% non-string values):

df = pd.DataFrame({'col': np.random.choice(['  aBc  DeF    ', float('nan')],
                                           p=[0.99, 0.01], size=1000000)})

%%timeit
df['col2'] = df['col'].str.strip().str.lower()
349 ms ± 18.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
df['col2'] = [x if pd.isna(x) else x.lower().strip() for x in df['col']]
442 ms ± 11.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
df['col2'] = [f(x) for x in df['col']]
235 ms ± 2.69 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

CodePudding user response:

df = pd.DataFrame(['ABCD    EFG', 'ABC  DE'], columns=['col1'])
df['col1'] = df.apply(lambda row: row['col1'].lower().strip(), axis=1)
  • Related