I use R for text mining, I want to count some strings in a data frame, they look like this in the text:
- "conducteur(trice)" , "conducteur.trice"
- "administratif(ve)" , "administratif.ve" , "administrati.ve"
- "agent(e)"
My code is:
data <- data %>%
mutate(Description = tolower(Description),
ve.count = str_count(Description, "[i].ve[ ]"),
e.count = str_count(Description, "(e)"),
trice.count = str_count(Description, "(trice)"))
I want to count the : .ve / (ve) / (ive) / .e / (e) / .trice / (trice)
My code can't detect what I want! Any help?
CodePudding user response:
Does this help?
library(tidyverse)
data <- tibble(
Description = c(
"conducteur(trice) or conducteur.trice",
"administratif(ve) , administratif.ve or administrati.ve",
"agent(e)"
)
)
data %>%
mutate(
# count ve inside parenthesis
ve.count = Description %>% str_extract("[(][^()] [)]") %>% str_count("ve")
)
#> # A tibble: 3 × 2
#> Description ve.count
#> <chr> <int>
#> 1 conducteur(trice) or conducteur.trice 0
#> 2 administratif(ve) , administratif.ve or administrati.ve 1
#> 3 agent(e) 0
Created on 2022-05-09 by the reprex package (v2.0.0)
CodePudding user response:
@danlooo , i tried this "[\(\.] [\) \.\,]" and it worked for me.
so it gives :
<- data %>%
mutate(Description = tolower(Description),
ve.count = str_count(Description, "[\\(\\.]ve[\\) \\.\\,]"),
e.count = str_count(Description, "[\\(\\.]e[\\) \\.\\,]"),
trice.count = str_count(Description, "[\\(\\.]trice[\\) \\.\\,]"))