I have a Data Frame that has two columns like that:
USER ID | text |
---|---|
1 | "..." |
2 | "..." |
. | . |
. | . |
. | . |
100 | "..." |
Let's say there are 100 users and each user has a text.
I want to count the proportion the texts that has question marks in them: for example, let's say I have only 20 texts in which there are question marks. That means the value I will get is 20/100 (I don't care how many questions marks are within each text).
I tried to use str_count() and build a loop for it:
for (i in 1:length(data_frame$text)) {
str_count(data_frame$text[i], pattern = "\\?")}
but it just not working, it's not even producing an error
CodePudding user response:
If you want to find if there is a question mark in the string (dichotomize as 1/0) you could do this in base R:
df <- data.frame(id = 1:10,
text = c(LETTERS[1:5], paste0(LETTERS[1:5],"?")))
df$question_mark <- grepl("\\?", df$text)*1
You can find the proportion by:
sum(df$question_mark) / nrow(df)
CodePudding user response:
You may want to use stringr::str_detect()
and you do not need a for loop.
Most of the str_* functions are vectorized, which is one of R's core strengths. (It still is a hidden for loop of course but it is implemented in c and so it's much faster as well as easier to write).
Consider:
df$test <- c("asa", "asa?", "asa??", "asa???", "asa??")
result <- paste0( sum(stringr::str_detect(df$test, "\\?")), "/", length(df$test) )
print(result)
4/5