How do I count variables when they have slightly different way of spelling?-CodePudding

I have a problem counting in R. The each variable has a slightly different spelling like it shows down below

df\<-data.frame(sweets= c("cookie", "CANDY", "Cookie", "cake", "IceCream", "Candy", "Chocolate COOKIE", "COOKIE", "CAKE", "Chocolate cake", "candy bar"))
df

I want to be able to count by categories like this below. to do that, I want to change the each variable names to be consistent.

df2<-data.frame(sweets= c("Cookie", "Candy", "Cookie", "Cake", "IceCream", "Candy", "Cookie", "Cookie", "Cake", "Cake", "Candy"))               
df3<- table(df2)

CodePudding user response：

table(df2$sweets[max.col(-t(adist(df2$sweets, 
                                  df1$sweets, 
                                  partial = TRUE, 
                                  ignore.case = TRUE)))])
    
Cake    Candy   Cookie IceCream 
3        3        4        1

CodePudding user response：

EDIT: toupper & fuzzyjoin method

If I'm understanding correctly, here is a possible idea that involves joining df1 with the desired values (df2) using fuzzjoin.

I also used toupper to uniform capital letters.

NOTE: fuzzy joins are tricky. Some "alterations" of these words might be to different to be accurately matched.

It does seem to work for the example provided though. More info on fuzzyjoin here.

library(fuzzyjoin)
library(dplyr)

sweets = c("cookie", "CANDY", "Cookie", "cake", "IceCream", "Candy", "Chocolate COOKIE", "COOKIE", "CAKE", "Chocolate cake", "candy bar")
df1<-data.frame(sweets = sweets)

# add UPPER_sweets
df1 <- df1 %>% 
  mutate(sweets_upper = toupper(sweets))


df2<-data.frame(sweets = c("Cookie", "Candy","Cake", "IceCream"))
df2 <- df2 %>% 
  mutate(sweets = toupper(sweets))

#perform fuzzy matching left join
stringdist_join(df1 , 
                df2, 
                method = 'jaccard',
                by=c("sweets_upper"="sweets"), 
                mode='left', 
                max_dist=99, 
                distance_col='dist') %>%
  group_by(sweets_upper) %>%
  filter(dist == min(dist))

Output:

# A tibble: 11 × 4
# Groups:   sweets_upper [7]
   sweets.x         sweets_upper     sweets.y  dist
   <chr>            <chr>            <chr>    <dbl>
 1 cookie           COOKIE           COOKIE   0    
 2 CANDY            CANDY            CANDY    0    
 3 Cookie           COOKIE           COOKIE   0    
 4 cake             CAKE             CAKE     0    
 5 IceCream         ICECREAM         ICECREAM 0    
 6 Candy            CANDY            CANDY    0    
 7 Chocolate COOKIE CHOCOLATE COOKIE COOKIE   0.5  
 8 COOKIE           COOKIE           COOKIE   0    
 9 CAKE             CAKE             CAKE     0    
10 Chocolate cake   CHOCOLATE CAKE   CAKE     0.556
11 candy bar        CANDY BAR        CANDY    0.375

Then, to get the count...

stringdist_join(df1 , 
                df2, 
                method = 'jaccard',
                by=c("sweets_upper"="sweets"), 
                mode='left', 
                max_dist=99, 
                distance_col='dist') %>%
  group_by(sweets_upper) %>%
  filter(dist == min(dist)) %>% 
  group_by(sweets.y) %>% 
  summarise(count = n())

Output:

# A tibble: 4 × 2
  sweets.y count
  <chr>    <int>
1 CAKE         3
2 CANDY        3
3 COOKIE       4
4 ICECREAM     1

CodePudding user response：

Sounds like regular expressions would help.

grep("candy", list_of_names, ignore.case = TRUE) will return all indexes with "candy", regardless of upper/lowercase. You can generalize this for each category you are interested in.