I have a problem counting in R. The each variable has a slightly different spelling like it shows down below
df\<-data.frame(sweets= c("cookie", "CANDY", "Cookie", "cake", "IceCream", "Candy", "Chocolate COOKIE", "COOKIE", "CAKE", "Chocolate cake", "candy bar"))
df
I want to be able to count by categories like this below. to do that, I want to change the each variable names to be consistent.
df2<-data.frame(sweets= c("Cookie", "Candy", "Cookie", "Cake", "IceCream", "Candy", "Cookie", "Cookie", "Cake", "Cake", "Candy"))
df3<- table(df2)
CodePudding user response:
table(df2$sweets[max.col(-t(adist(df2$sweets,
df1$sweets,
partial = TRUE,
ignore.case = TRUE)))])
Cake Candy Cookie IceCream
3 3 4 1
CodePudding user response:
EDIT: toupper & fuzzyjoin method
If I'm understanding correctly, here is a possible idea that involves joining df1 with the desired values (df2) using fuzzjoin
.
I also used toupper
to uniform capital letters.
NOTE: fuzzy joins are tricky. Some "alterations" of these words might be to different to be accurately matched.
It does seem to work for the example provided though. More info on fuzzyjoin here.
library(fuzzyjoin)
library(dplyr)
sweets = c("cookie", "CANDY", "Cookie", "cake", "IceCream", "Candy", "Chocolate COOKIE", "COOKIE", "CAKE", "Chocolate cake", "candy bar")
df1<-data.frame(sweets = sweets)
# add UPPER_sweets
df1 <- df1 %>%
mutate(sweets_upper = toupper(sweets))
df2<-data.frame(sweets = c("Cookie", "Candy","Cake", "IceCream"))
df2 <- df2 %>%
mutate(sweets = toupper(sweets))
#perform fuzzy matching left join
stringdist_join(df1 ,
df2,
method = 'jaccard',
by=c("sweets_upper"="sweets"),
mode='left',
max_dist=99,
distance_col='dist') %>%
group_by(sweets_upper) %>%
filter(dist == min(dist))
Output:
# A tibble: 11 × 4
# Groups: sweets_upper [7]
sweets.x sweets_upper sweets.y dist
<chr> <chr> <chr> <dbl>
1 cookie COOKIE COOKIE 0
2 CANDY CANDY CANDY 0
3 Cookie COOKIE COOKIE 0
4 cake CAKE CAKE 0
5 IceCream ICECREAM ICECREAM 0
6 Candy CANDY CANDY 0
7 Chocolate COOKIE CHOCOLATE COOKIE COOKIE 0.5
8 COOKIE COOKIE COOKIE 0
9 CAKE CAKE CAKE 0
10 Chocolate cake CHOCOLATE CAKE CAKE 0.556
11 candy bar CANDY BAR CANDY 0.375
Then, to get the count...
stringdist_join(df1 ,
df2,
method = 'jaccard',
by=c("sweets_upper"="sweets"),
mode='left',
max_dist=99,
distance_col='dist') %>%
group_by(sweets_upper) %>%
filter(dist == min(dist)) %>%
group_by(sweets.y) %>%
summarise(count = n())
Output:
# A tibble: 4 × 2
sweets.y count
<chr> <int>
1 CAKE 3
2 CANDY 3
3 COOKIE 4
4 ICECREAM 1
CodePudding user response:
Sounds like regular expressions would help.
grep("candy", list_of_names, ignore.case = TRUE)
will return all indexes with "candy", regardless of upper/lowercase. You can generalize this for each category you are interested in.