I have a dataset with 100,000 rows and 300 columns
Here is the sample dataset:
EVENT_DTL
0 8. Background : no job / living with marriage_virgin 9. Social status : doing pretty well with his family
1 8. Background : Engineer / living with his mom marriage_married
How can I remove the white blank between ‘with’ and ‘marriage_virgin’ but leave only one white blank?
Desired outout would be:
EVENT_DTL
0 8. Background : no job / living with marriage_virgin 9. Social status : doing pretty well with his family
1 8. Background : Engineer / living with his mom marriage_married
CodePudding user response:
You can use pandas.Series.str
to replace "\s "
(1 or more whitespace) by a single whitespace.
Try this :
df["EVENT_DTL"]= df["EVENT_DTL"].str.replace("\s ", " ", regex=True)
Output :
print(df)
EVENT_DTL
0 8. Background : no job / living with marriage_virgin 9. Social status : doing pretty well with his family
1 8. Background : Engineer / living with his mom marriage_married
If you need to clean up the whole dataframe, use pandas.DataFrame.replace
:
df.astype(str).replace("\s ", " ", regex=True, inplace=True)
CodePudding user response:
You can call string methods for a DataFrame column with
df["EVENT_DTL"].str.strip()
but .strip()
doesn't work, because it only removed extra characters from the start and end of the string. To remove all duplicate whitespaces you can use regex:
import re
import pandas as pd
d = {"EVENT_DTL": [
"8. Background : no job / living with marriage_virgin 9. Social status : doing pretty well with his family",
"8. Background : Engineer / living with his mom marriage_married"
]}
df = pd.DataFrame(d)
pattern = re.compile(" ")
df["EVENT_DTL"] = df["EVENT_DTL"].apply(lambda x: pattern.sub(" ", x))
print(df["EVENT_DTL"][0])