Let say I have a txt file look like this
aaaaaaaaaaaaaaaadddddddddddddssssssssssbbbbbbbbbbbbbccccccccccxxxxxxxxxxsddddddaaasdadvvvvvvvvbbbbbbbaxxxxnnnnnnnnnwwwwwwwwwwwwww
How do I apply str_split()
in this case?
I am trying to calculate how many letter "a" has show up in this txt file
This what I have so far,
str_split(mytxtfile, "")
then I ran str_length(str_detect(mytxtfile, "a"))
The function came out with less than 9 and I believed I did something wrong when I using str_split()
Please help!
CodePudding user response:
base R option using lengths
with regmatches
and gregexpr
to count your character in a string like this:
input = 'aaaaaaaaaaaaaaaadddddddddddddssssssssssbbbbbbbbbbbbbccccccccccxxxxxxxxxxsddddddaaasdadvvvvvvvvbbbbbbbaxxxxnnnnnnnnnwwwwwwwwwwwwww'
lengths(regmatches(input, gregexpr("a", input)))
#> [1] 21
Created on 2022-12-09 with reprex v2.0.2
@jay.sf is right! regmatches
is not necessary:
input = 'aaaaaaaaaaaaaaaadddddddddddddssssssssssbbbbbbbbbbbbbccccccccccxxxxxxxxxxsddddddaaasdadvvvvvvvvbbbbbbbaxxxxnnnnnnnnnwwwwwwwwwwwwww'
lengths(gregexpr("a", input))
#> [1] 21
Created on 2022-12-09 with reprex v2.0.2
CodePudding user response:
You can use str_count
to count inside of a string:
library(stringr)
mytxtfile = "aaaaaaaaaaaaaaaadddddddddddddssssssssssbbbbbbbbbbbbbccccccccccxxxxxxxxxxsddddddaaasdadvvvvvvvvbbbbbbbaxxxxnnnnnnnnnwwwwwwwwwwwwww"
str_count(mytxtfile, "a")
#[1] 21
Using str_split_*
functions, you can use str_split_1
(stringr 1.5.0
required) str_detect
sum
(although as you see it str_count
is much better):
str_split_1(mytxtfile, "") |>
str_detect("a") |>
sum()
#[1] 21
CodePudding user response:
Adding to the options, removing all but a's and counting the chars:
nchar(gsub("[^a]", "", mytxtfile))
Output:
[1] 21
CodePudding user response:
In case you want to count the number of times a character is repeated in sequence not just for one character but for all characters in the string, here's a tidyverse
solution:
library(tidyverse)
data.frame(strng) %>%
mutate(
# 1. split `strng` any time next char is not the same as prior char:
sameChar = str_split(strng, "(?<=(.))(?!\\1|$)"),
# 2. count number of chars in each element of list:
sameCharN = lapply(sameChar, function(x) nchar(x)))
strng
1 aaaaaaaaaaaaaaaadddddddddddddssssssssssbbbbbbbbbbbbbccccccccccxxxxxxxxxxsddddddaaasdadvvvvvvvvbbbbbbbaxxxxnnnnnnnnnwwwwwwwwwwwwww
sameChar
1 aaaaaaaaaaaaaaaa, ddddddddddddd, ssssssssss, bbbbbbbbbbbbb, cccccccccc, xxxxxxxxxx, s, dddddd, aaa, s, d, a, d, vvvvvvvv, bbbbbbb, a, xxxx, nnnnnnnnn, wwwwwwwwwwwwww
sameCharN
1 16, 13, 10, 13, 10, 10, 1, 6, 3, 1, 1, 1, 1, 8, 7, 1, 4, 9, 14
Here's how the str_split
regex works:
(?<=(.))
: positive look-behind defining any character (.
) occurring to the left of the split point as a capture group (in parentheses)(?!\\1|$)
: postive look-ahead asserting that the split must not (!
) be performed if the immediately next char is the same (\\1
) as the char immediately prior OR (|
) the final position in the string
Data:
strng <- "aaaaaaaaaaaaaaaadddddddddddddssssssssssbbbbbbbbbbbbbccccccccccxxxxxxxxxxsddddddaaasdadvvvvvvvvbbbbbbbaxxxxnnnnnnnnnwwwwwwwwwwwwww"