I have a DataFrame where every entry is a string value and a given entry may contain consecutive white spaces. For example:
import re
import pandas as pd
df = pd.DataFrame({'col1':['a--b','c d'], 'col2':['e f','g---h']})
print(df)
Output of print(df)
(this is the initial df
):
col1 col2
0 a--b e f
1 c d g---h
I want to replace any consecutive white spaces with a single space in all the entries of df
. So in this example, 'c d'
(with two consecutive white spaces) should be replaced with 'c d'
, and 'e f'
(with three consecutive white spaces) should be replaced with 'e f'
.
Approach 1:
I get the correct result using df.replace
, like so
# Approach 1 - works fine
df = df.replace('\s ', ' ', regex = True)
print(df)
Output of print(df)
(this is the correct result expected):
col1 col2
0 a--b e f
1 c d g---h
Approach 2:
However, I get TypeError: expected string or bytes-like object
while using df.transform
, like so
# Approach 2 - gives TypeError
df = df.transform(lambda s: re.sub('\s ', ' ', s))
print(df)
Output:
...
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/re.py", line 210, in sub
return _compile(pattern, flags).sub(repl, string, count)
TypeError: expected string or bytes-like object
Approach 3:
I get ValueError: Transform function failed
if I do
# Approach 3 - gives ValueError
df = df.transform(lambda s: ' '.join(s.split()))
print(df)
Output:
...
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pandas/core/apply.py", line 227, in transform
raise ValueError("Transform function failed") from err
ValueError: Transform function failed
So where am I going wrong with the Approach 2 and 3?
Asking because the df.transform
seems more powerful for transforming each cell in a DataFrame and will need that in my project for more complex transformations. Thank you!
CodePudding user response:
You need DataFrame.applymap
for element wise processing, because both function working with scalars:
df = df.applymap(lambda s: re.sub('\s ', ' ', s))
print(df)
col1 col2
0 a--b e f
1 c d g---h
df = df.applymap(lambda s: ' '.join(s.split()))
print(df)
col1 col2
0 a--b e f
1 c d g---h
Method DataFrame.transform
processing columns like Series
, so it failed.
You can rewrite second solution with Series.str.split
and Series.str.join
for processing columns (Series
):
def f(x):
#test - processing column
#print (x)
return x.str.split().str.join(' ')
df = df.transform(f)
print (df)
col1 col2
0 a--b e f
1 c d g---h