I would like to clean a Dataframe in such a way that only cells that contain numbers will not have empty spaces but cells with names remain the same.
Author
07 07 34
08 26 20
08 26 20
Tata Smith
Jhon Doe
08 26 22
3409243
here is my approach which is failing
df.loc[df["Author"].str.isdigit(), "Author"] = df["Author"].strip()
How can I handle this?
CodePudding user response:
You might want to use regex.
import pandas as pd
import re
# Create a sample dataframe
import io
df = pd.read_csv(io.StringIO('Author\n 07 07 34 \n 08 26 20 \n 08 26 20 \n Tata Smith\n Jhon Doe\n 08 26 22\n 3409243'))
# Use regex
mask = df['Author'].str.fullmatch(r'[\d ]*')
df.loc[mask, 'Author'] = df.loc[mask, 'Author'].str.replace(' ', '')
# You can also do the same treatment by the following line
# df['Author'] = df['Author'].apply(lambda s: s.replace(' ', '') if re.match(r'[\d ]*$', s) else s)
Author |
---|
070734 |
082620 |
082620 |
Tata Smith |
Jhon Doe |
082622 |
3409243 |
CodePudding user response:
How about this?
import pandas as pd
df = pd.read_csv('two.csv')
# remove spaces on copy
df['Author_clean'] = df['Author'].str.replace(" ","")
# try conversion to numeric if possible
df['Author_clean'] = df['Author_clean'].apply(pd.to_numeric, errors='coerce')
# fill missing vals with original strings
df['Author_clean'].fillna(df['Author'], inplace=True)
print(df.head(10))
Output:
Author Author_clean
0 07 07 34 70734.0
1 08 26 20 82620.0
2 08 26 20 82620.0
3 Tata Smith Tata Smith
4 Jhon Doe Jhon Doe
5 08 26 22 82622.0
6 3409243 3409243.0