Home > front end >  Best way to recognize same club names that are written in a different way
Best way to recognize same club names that are written in a different way

Time:01-13

    for x in range(len(fclub1)-1):
        for y in range(x 1,len(fclub1)-1):
            if  SequenceMatcher(None,fclub1[x], fclub1[y]).ratio() > 0.4:
                if SequenceMatcher(None,fclub2[x], fclub2[y]).ratio() > 0.4:
                    if float(fbest_odds_1[x]) < float(fbest_odds_1[y]):
                        fbest_odds_1[x] = fbest_odds_1[y]
                    if float(fbest_odds_x[x]) < float(fbest_odds_x[y]):
                        fbest_odds_x[x] = fbest_odds_x[y]
                    if float(fbest_odds_2[x]) < float(fbest_odds_2[y]):
                        fbest_odds_2[x] = fbest_odds_2[y]
                    fclub1.pop(y)
                    fclub2.pop(y)
                    fbest_odds_1.pop(y)
                    fbest_odds_x.pop(y)
                    fbest_odds_2.pop(y)

It can't reliably match club names from different bookkeeps, for example Manchester United and Man. Utd.

I tried fixing it with SequenceMatcher and making it recognize at least some part of the club name, but then it started to compare different clubs saying that they are the same:Aston Villa - Atherton Collieries and Leeds - Liversedge

CodePudding user response:

The best solution is probably the most boring, just make a list of often used names for each team and use that, such as:

def standardize_team_name(name):
    if name in ["Manchester", "Manchester United", "Man. Utd."]:
        return "Manchester United"
    elif name in ["Aston Villa", ...]:
        return "Aston Villa"
    elif ...

CodePudding user response:

Something that would address the particular case you mention would be

def is_abbr(shortened, full):
   short_words = shortened.replace('.', '').split(' ')
   full_words = full.split(' ')
   match = zip(short_words, full_words)
   return all(is_subseq(short, full) for short, full in match)

One coding of is_subseq is the following, taken from here:

def is_subseq(x, y):
   it = iter(y)
   return all(any(c == ch for c in it) for ch in x)

You could also look for better string comparison modules. Or build up a lookup table for abbreviations. If your program has access to the internet, you could also just do a Google search for each name and see what comes up, and write some code to process the result to figure out what football club it is.

CodePudding user response:

I ended up using the fuzzywuzzy library and fuzzy.partial_ratio() function

  • Related