removing of unnecessary spaces in text-CodePudding

Could you please shed some light on this?
The spaces are not dealt properly (test C and E) and I don't understand what is wrong.
Thanks a lot.

foo={'testing':['this    is test A','  this is test B',' this is test C ','   this is test D','   this is test E  ']}
foo=pd.DataFrame(foo,columns=['testing']) 
print("Before:")
print(foo,"\n")
foo.replace(r'\s ', ' ', regex=True,inplace=True)
print("After:")
print(foo)

Before:
               testing
0    this    is test A
1       this is test B
2      this is test C 
3       this is test D
4     this is test E   

After:
            testing
0    this is test A
1    this is test B
2   this is test C 
3    this is test D
4   this is test E

CodePudding user response：

It's probably easier to process the dictionary before constructing the dataframe. You also need to account for leading space in any of the strings.

import pandas as pd
import re

foo={'testing':['this    is test A','  this is test B',' this is test C ','   this is test D','   this is test E  ']}

foo['testing'] = [re.sub('\s ', ' ', s.strip()) for s in foo['testing']]

foo = pd.DataFrame(foo, columns=['testing'])

print(foo)

Output:

          testing
0  this is test A
1  this is test B
2  this is test C
3  this is test D
4  this is test E

CodePudding user response：

# remove leading and trailing space first; then use regex to replace space inside the strings
foo['testing'] = foo['testing'].str.strip().str.replace(r'\s ', ' ', regex=True)
print(foo)
          testing
0  this is test A
1  this is test B
2  this is test C
3  this is test D
4  this is test E

CodePudding user response：

You can do it without regex:

foo["testing"] = foo["testing"].str.split().str.join(" ")
print(foo)

Prints:

          testing
0  this is test A
1  this is test B
2  this is test C
3  this is test D
4  this is test E