I'm wanting to change a dataframe column so the values are lower case and also have their whitespace stripped.
For this I used chained str
transformations.
df.loc[:, column] = df.loc[:, column].str.lower().str.strip()
The above snippet works, but it looks quite messy as I have to use .str.
twice - is there a better/more efficient solution?
CodePudding user response:
You can use a list comprehension:
df['col2'] = [x.lower().strip() for x in df['col']]
Doing this can be faster than chaining multiple str
:
%%timeit
df['col2'] = df['col'].str.strip().str.lower()
# 344 ms ± 12.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
df['col2'] = [x.lower().strip() for x in s]
# 182 ms ± 3.13 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
used input (1M rows):
df = pd.DataFrame({'col': [' aBc DeF ']*1000000})
NB. I used strip
before lower
in the comparison as this is faster than lower
, then strip
.
non string values in the input
If there are non string values, the list comprehension will fail.
There are several possibilities to handle this.
Using a systematic check:
df['col2'] = [x if pd.isna(x) else x.lower().strip() for x in df['col']]
this will however add an extra burden and will probably be slower compared to only two chained str
(but might still be faster than more than 2 chained str
).
Using a custom function and try/except:
try/except has a very low cost if no error is triggered, which might make it a viable solution when the number of non-string values is low. In this case one would use:
def f(x):
try:
return x.lower().strip()
except: # add exhaustive expected exceptions if known
return x
df['col2'] = [f(x) for x in df['col']]
timings (1M rows with 1% non-string values):
df = pd.DataFrame({'col': np.random.choice([' aBc DeF ', float('nan')],
p=[0.99, 0.01], size=1000000)})
%%timeit
df['col2'] = df['col'].str.strip().str.lower()
349 ms ± 18.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
df['col2'] = [x if pd.isna(x) else x.lower().strip() for x in df['col']]
442 ms ± 11.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
df['col2'] = [f(x) for x in df['col']]
235 ms ± 2.69 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
CodePudding user response:
df = pd.DataFrame(['ABCD EFG', 'ABC DE'], columns=['col1'])
df['col1'] = df.apply(lambda row: row['col1'].lower().strip(), axis=1)