How to remove all string values from numerical data except "." and "-" using pan-CodePudding

I have a column with numerical data and it has some string values like $ , # etc. attached to every column. My numerical data is like this:

SIZE = [ 10 OZ , 20 OZ, 2.5. OZ , #30.1 OZ, !2 O Z , 1-8 OZ, 1-7OZ, 20 OZ]

But when I delete all the string characters, it also removes the "." and "-" characters, which I don't want to remove. How can remove string values from numerical column except some strings like decimal and "-" using pandas?

my desire output is like this

SIZE = [ 10, 20, 2.5, 30.1, 2, 1-8, 1-7, 20]

and this is my sample data just to simplify, in my actual data i have around 600 values.

CodePudding user response：

Try this :

import re

full_pattern = re.compile(r"[^\d,.-] ")

def re_replace(data_list):
    new_data = []
    for data in data_list:
         new_data.append(re.sub(full_pattern, '', data))
    return new_data

data = [ "10 OZ" , "20 OZ", "2.5. OZ" , "#30.1 OZ", "!2 O Z" , "1-8 OZ", "1-7OZ", "20 OZ"]
st = re_replace(data)

print(st)

Output :

['10', '20', '2.5.', '30.1', '2', '1-8', '1-7', '20']

CodePudding user response：

Haven't worked with pandas but you can use this regex to get the required results.

import re
re.sub("[^0-9^.^-]", "", "sdkjh987978asd098.as0980-a98sd")

CodePudding user response：

Pandas' .str.replace() takes regular expressions, too.

import pandas as pd

SIZE = [ "10 OZ" , "20 OZ", "2.5. OZ" , "#30.1 OZ", "!2 O Z" , "1-8 OZ", "1-7OZ", "20 OZ"]

df = pd.DataFrame({"SIZE": SIZE})

# remove everything that's not a number, dot or hyphen and strip leading/trailing dots
df["SIZE"] = (df.SIZE
.str.replace("[^0-9.-] ","", regex=True) 
.str.strip("."))

Result: