pandas: How do I remove duplicate rows by column similarity?-CodePudding

How can I consider a string to be a duplicate line if the difference between the strings is less than one character?

I know how to get the count of differences between strings. I don't know how to use it in a dataframe.

count = sum(1 for a, b in zip(str1, str2) if a != b)

I am using the following to process duplicate lines when a character is an exact match How can I use this to determine if there is a string difference duplication?

df = df.groupby("thumbnail_hash", sort=False, as_index=False).agg(
    {
        **columns,
        "pk": "first",
        "affiliate_url": lambda x: " ".join(x.dropna()),
        "site": lambda x: " ".join(x.dropna()),
    }
)

What is the best way to use this in a dataframe column, and groupby?

Input data:

pd.DataFrame(
    {
        "pk": {0: "1", 1: "2", 2: "3", 3: "1", 4: "3"},
        "thumbnail_hash": {
            0: "3131f3aff6f33303",
            1: "3131f3a7f6f33303",
            2: "edc939781b2e2e0f",
            3: "3131f3aff6f33303",
            4: "edc939781b2e2e0f",
        },
        "affiliate_url": {0: "url1", 1: "url2", 2: "url3", 3: "url1 url2", 4: "url3"},
        "site": {0: "site1", 1: "site2", 2: "site2", 3: "site1 site2", 4: "site2"},
    }
)

CodePudding user response：

Firstly I'd note that your string similarity function is quite idiosyncratic, and you'd likely be better off using something more robust such as Levenshtein distance. Secondly, I don't know a fast way to do what you're doing (see also this answer), but here is something which will work.

First check all of the possible pairs among your unique thumbnail hashes, and return those below a certain distance (you said "less than one character", which I interpreted as "one character or less", otherwise you could just use regular drop_duplicates):

from itertools import combinations

check_similar = lambda s1, s2: sum(1 for a, b in zip(s1, s2) if a != b) 
matches = [s for s in combinations(df_.thumbnail_hash.unique(), 2) if check_similar(*s) < 2]

Then of the similar matches, just replace one with the other in a loop:

for (match1, match2) in matches:
    df_.thumbnail_hash = df_.thumbnail_hash.replace(match1, match2)

And then you can just do your regular drop_duplicates:

df_.groupby("thumbnail_hash", sort=False, as_index=False).agg(
    {
        "pk": "first",
        "affiliate_url": lambda x: " ".join(x.dropna()),
        "site": lambda x: " ".join(x.dropna()),
    }
)