Remove duplicates from row applying exceptions-CodePudding

I try to remove duplicates on rows but I need to have strings with length <= 2 and integer.

I have a sentence like this:

AIR OPTIX Air Optix plus HydraGlyde Lenti a Contatto Mensili, 6 Lenti, BC a 6 mm, DIA 14.2 mm, -0.75 Diopt

What I need to obtain is:

AIR OPTIX plus HydraGlyde Lenti a Contatto Mensili, 6 Lenti, BC a 6 mm, DIA 14.2 mm, -0.75 Diopt

What I manage to do with the below function is:

AIR OPTIX plus HydraGlyde Lenti Contatto Mensili, 6 Lenti, BC mm, DIA 14.2 -0.75 Diopt

Missing a,6,mm.

I need to modify the function so that the duplicate removal will not consider strings with len <=2 or any kind of integer instead, these values should remain as they are.

def uniqueList(row):
    words = str(row).split(" ")
    unique = words[0]
    for w in words:
        if w.lower() not in unique.lower() :
            unique = unique   " "   w

return unique

df["Correction_Value"] = df["Correction_Value"].apply(uniqueList)

CodePudding user response：

Here. Not the best code, but it get's the job done - it passes the test.

def uniqueList(row):
    words = str(row).split(" ")
    unique = words[0]
    for w in words:
        
        try:
            int(w)
            unique = unique   " "   w
            continue
        except ValueError:
            pass

        if 'mm' in w:
            unique = unique   " "   w
            continue

        if (w.lower() not in unique.lower()) or (len(w) <= 2):
            unique = unique   " "   w

    return unique

df["Correction_Value"] = df["Correction_Value"].apply(uniqueList)

CodePudding user response：

I try to remove duplicates on rows but I need to have strings with length <= 2 and integer.

The code below does that (Note that mm != mm,)

def rem_dups(value: str) -> str:
    def _is_int(x):
        try:
            int(x)
            return True
        except ValueError:
            return False
    result = []
    seen = set()
    words = value.split()
    for word in words:
        if len(word) <= 2 or _is_int(word):
            result.append(word)
        else:
            if word.lower() not in seen and word.upper() not in seen:
                result.append(word)
                seen.add(word)
    return ' '.join(result)


_value = 'AIR OPTIX Air Optix plus HydraGlyde Lenti a Contatto Mensili, 6 Lenti, BC a 6 mm, DIA 14.2 mm, -0.75 Diopt'
print(rem_dups(_value))

output

AIR OPTIX plus HydraGlyde Lenti a Contatto Mensili, 6 Lenti, BC a 6 mm, DIA 14.2 -0.75 Diopt