I have a column with numerical data and it has some string values like $ , # etc. attached to every column. My numerical data is like this:
SIZE = [ 10 OZ , 20 OZ, 2.5. OZ , #30.1 OZ, !2 O Z , 1-8 OZ, 1-7OZ, 20 OZ]
But when I delete all the string characters, it also removes the "." and "-" characters, which I don't want to remove. How can remove string values from numerical column except some strings like decimal and "-" using pandas?
my desire output is like this
SIZE = [ 10, 20, 2.5, 30.1, 2, 1-8, 1-7, 20]
and this is my sample data just to simplify, in my actual data i have around 600 values.
CodePudding user response:
Try this :
import re
full_pattern = re.compile(r"[^\d,.-] ")
def re_replace(data_list):
new_data = []
for data in data_list:
new_data.append(re.sub(full_pattern, '', data))
return new_data
data = [ "10 OZ" , "20 OZ", "2.5. OZ" , "#30.1 OZ", "!2 O Z" , "1-8 OZ", "1-7OZ", "20 OZ"]
st = re_replace(data)
print(st)
Output :
['10', '20', '2.5.', '30.1', '2', '1-8', '1-7', '20']
CodePudding user response:
Haven't worked with pandas but you can use this regex to get the required results.
import re
re.sub("[^0-9^.^-]", "", "sdkjh987978asd098.as0980-a98sd")
CodePudding user response:
Pandas' .str.replace()
takes regular expressions, too.
import pandas as pd
SIZE = [ "10 OZ" , "20 OZ", "2.5. OZ" , "#30.1 OZ", "!2 O Z" , "1-8 OZ", "1-7OZ", "20 OZ"]
df = pd.DataFrame({"SIZE": SIZE})
# remove everything that's not a number, dot or hyphen and strip leading/trailing dots
df["SIZE"] = (df.SIZE
.str.replace("[^0-9.-] ","", regex=True)
.str.strip("."))
Result:
>>> df
SIZE
0 10
1 20
2 2.5
3 30.1
4 2
5 1-8
6 1-7
7 20