I'm listing all countries where "Cocoa Percent" > 70% or "Rating" > 3.5 on rstudio, ggplot2. However, the plot shows some countries that did not match the criteria such as South Korea (70% cocoa, 3.25 rating point), Netherlands (70% cocoa, 3.5 rating point), Russia, South Korea, Suriname, etc. It's supposed to be 51 countries as I checked on Excel advanced filter, instead of 56 on ggplot
This is my code chunk:
chocolate_df %>% filter(`Cocoa\nPercent` > 70 | Rating > 3.5) %>%
ggplot(aes(x=`Company\nLocation`))
geom_bar() theme(axis.text.x = element_text(angle=90))
CodePudding user response:
The issue is that the variable Cocoa\nPercent
was read into R as a character variable, including the %
symbol. You need to convert it to a numeric variable.
Here's the same dataset from a Github repository:
library(readr)
library(ggplot2)
library(dplyr)
cacao <- read_csv("https://raw.githubusercontent.com/ry05/Chocolate-Bar-Analysis/master/Dataset/flavors_of_cacao.csv")
glimpse(cacao, width = 100)
Rows: 1,795
Columns: 9
$ `Company \n(Maker-if known)` <chr> "A. Morin", "A. Morin", "A. Morin", "A. Morin", "A. Mo…
$ `Specific Bean Origin\nor Bar Name` <chr> "Agua Grande", "Kpime", "Atsane", "Akata", "Quilla", "…
$ REF <dbl> 1876, 1676, 1676, 1680, 1704, 1315, 1315, 1315, 1319, …
$ `Review\nDate` <dbl> 2016, 2015, 2015, 2015, 2015, 2014, 2014, 2014, 2014, …
$ `Cocoa\nPercent` <chr> "63%", "70%", "70%", "70%", "70%", "70%", "70%", "70%"…
$ `Company\nLocation` <chr> "France", "France", "France", "France", "France", "Fra…
$ Rating <dbl> 3.75, 2.75, 3.00, 3.50, 3.50, 2.75, 3.50, 3.50, 3.75, …
$ `Bean\nType` <chr> " ", " ", " ", " ", " ", "Criollo", " ", "Criollo", "C…
$ `Broad Bean\nOrigin` <chr> "Sao Tome", "Togo", "Togo", "Togo", "Peru", "Venezuela…
Using your filter there are 56 rows:
cacao %>%
filter(`Cocoa\nPercent` > 70 | Rating > 3.5) %>%
distinct(`Company\nLocation`) %>%
nrow()
[1] 56
After conversion to numeric there are 51 rows:
cacao %>%
mutate(`Cocoa\nPercent` = as.numeric(gsub("%", "", `Cocoa\nPercent`))) %>%
filter(`Cocoa\nPercent` > 70 | Rating > 3.5) %>%
distinct(`Company\nLocation`) %>%
nrow()
[1] 51