Count unique values, ignore the spelling-CodePudding

I have a problem. I have the following dataframe. I want to count all the unique values. As you can see the problem is, that some of the words are uppercase or lowercase but are compleately the same thing i want to count. So in my case "Wifi" and "wifi" should be counted as 2. Same for the others. Is there a way i can do that by for example ignore the upper and lower case? And as you can see there are different writings for wifi (for example "Wifi 230 mb/s") is there a way to count the wifis when wifi is in the string?

d = {'host_id': [1, 1, 1, 2, 2, 3, 3, 3, 3],
     'value': ['Hot Water', 'Wifi', 'Kitchen',
               'Wifi', 'Hot Water',
               'Coffe Maker', 'wifi', 'hot Water', 'Wifi 230 mb/s']}
df = pd.DataFrame(data=d)

print(df)


print(len(df[df['value'].str.contains("Wifi", case=False)]))
print(df['value'].unique())
print(len(df['value'].unique()))

[out]
   host_id        value
0        1    Hot Water
1        1         Wifi
2        1      Kitchen
3        2         Wifi
4        2    Hot Water
5        3  Coffe Maker
6        3         wifi
7        3    hot Water
8        3    Wifi 230 mb/s

4 # count wifi

['Hot Water' 'Wifi' 'Kitchen' 'Coffe Maker' 'wifi' 'hot Water'] # unique values
6 # len unique values

What [out] should look like:

        value     count
0   Hot Water         3
1        Wifi         4
2     Kitchen         1
3 Coffe Maker         1

CodePudding user response：

If there is problem only with wifi - possible another substrings use:

df['value'] = (df['value'].mask(df['value'].str.contains("Wifi", case=False), 'wifi')
                          .str.title())
print (df)
   host_id        value
0        1    Hot Water
1        1         Wifi
2        1      Kitchen
3        2         Wifi
4        2    Hot Water
5        3  Coffe Maker
6        3         Wifi
7        3    Hot Water
8        3         Wifi

print(df['value'].value_counts())
Wifi           4
Hot Water      3
Kitchen        1
Coffe Maker    1
Name: value, dtype: int64


print(df.groupby('value', sort=False).size().reset_index(name='count'))
         value  count
0    Hot Water      3
1         Wifi      4
2      Kitchen      1
3  Coffe Maker      1

EDIT:

#counts original values wit hconvert to uppercase first latters
s = df['value'].str.title().value_counts()
print (s)
Wifi             3
Hot Water        3
Wifi 230 Mb/S    1
Kitchen          1
Coffe Maker      1
Name: value, dtype: int64

#filter if counts greater like N
N = 2
good = s.index[s.gt(N)]       
print (good)
Index(['Wifi', 'Hot Water'], dtype='object')

#extract values by list good

import re

pat = '|'.join(r"\b{}\b".format(x) for x in good)
df['new'] = df['value'].str.extract(rf'({pat})', expand=False, flags=re.I).str.title()
print (df)
   host_id          value        new
0        1      Hot Water  Hot Water
1        1           Wifi       Wifi
2        1        Kitchen        NaN
3        2           Wifi       Wifi
4        2      Hot Water  Hot Water
5        3    Coffe Maker        NaN
6        3           wifi       Wifi
7        3      hot Water  Hot Water
8        3  Wifi 230 mb/s       Wifi

df1 = df.groupby('new', sort=False).size().reset_index(name='count')
print (df1)
         new  count
0  Hot Water      3
1       Wifi      4


#get values not matched to good list (working if no NaNs in original column)    
df2 = df[df['new'].isna()].groupby('value', sort=False).size().reset_index(name='count')
print (df2)
         value  count
0      Kitchen      1
1  Coffe Maker      1

If need both:

df = pd.concat([df1, df2], ignore_index=True)

CodePudding user response：

Use str.lower() or str.upper() method on your list before comparing them. That should eliminate duplicates. If you would like to eliminate typos or other similar strings you can use python-Levenshtein to calculate distance and set 'cut off point' https://pypi.org/project/python-Levenshtein/