How to use strip to substring in dataframe?-CodePudding

I have a dataset with 100,000 rows and 300 columns

Here is the sample dataset:

    EVENT_DTL
0   8. Background : no job / living with         marriage_virgin 9. Social status : doing pretty well with his family

1   8. Background : Engineer / living with his mom marriage_married

How can I remove the white blank between ‘with’ and ‘marriage_virgin’ but leave only one white blank?

Desired outout would be:

        EVENT_DTL
    0   8. Background : no job / living with marriage_virgin 9. Social status : doing pretty well with his family
    
    1   8. Background : Engineer / living with his mom marriage_married

CodePudding user response：

You can use pandas.Series.str to replace "\s " (1 or more whitespace) by a single whitespace.

Try this :

df["EVENT_DTL"]= df["EVENT_DTL"].str.replace("\s ", " ", regex=True)

Output :

print(df)
                                                                                                   EVENT_DTL
0  8. Background : no job / living with marriage_virgin 9. Social status : doing pretty well with his family
1  8. Background : Engineer / living with his mom marriage_married

If you need to clean up the whole dataframe, use pandas.DataFrame.replace :

df.astype(str).replace("\s ", " ", regex=True, inplace=True)

CodePudding user response：

You can call string methods for a DataFrame column with

df["EVENT_DTL"].str.strip()

but .strip() doesn't work, because it only removed extra characters from the start and end of the string. To remove all duplicate whitespaces you can use regex:

import re
import pandas as pd

d = {"EVENT_DTL": [
    "8. Background : no job / living with         marriage_virgin 9. Social status : doing pretty well with his family",
    "8. Background : Engineer / living with his mom marriage_married"
]}
df = pd.DataFrame(d)
pattern = re.compile("  ")
df["EVENT_DTL"] = df["EVENT_DTL"].apply(lambda x: pattern.sub(" ", x))
print(df["EVENT_DTL"][0])