How to split a string in r of a txt file with lots of letters without spacing?-CodePudding

Let say I have a txt file look like this

aaaaaaaaaaaaaaaadddddddddddddssssssssssbbbbbbbbbbbbbccccccccccxxxxxxxxxxsddddddaaasdadvvvvvvvvbbbbbbbaxxxxnnnnnnnnnwwwwwwwwwwwwww

How do I apply str_split() in this case?

I am trying to calculate how many letter "a" has show up in this txt file

This what I have so far,

str_split(mytxtfile, "")

then I ran str_length(str_detect(mytxtfile, "a"))

The function came out with less than 9 and I believed I did something wrong when I using str_split()

Please help!

CodePudding user response：

base R option using lengths with regmatches and gregexpr to count your character in a string like this:

input = 'aaaaaaaaaaaaaaaadddddddddddddssssssssssbbbbbbbbbbbbbccccccccccxxxxxxxxxxsddddddaaasdadvvvvvvvvbbbbbbbaxxxxnnnnnnnnnwwwwwwwwwwwwww'
lengths(regmatches(input, gregexpr("a", input)))
#> [1] 21

^{Created on 2022-12-09 with reprex v2.0.2}

@jay.sf is right! regmatches is not necessary:

input = 'aaaaaaaaaaaaaaaadddddddddddddssssssssssbbbbbbbbbbbbbccccccccccxxxxxxxxxxsddddddaaasdadvvvvvvvvbbbbbbbaxxxxnnnnnnnnnwwwwwwwwwwwwww'
lengths(gregexpr("a", input))
#> [1] 21

^{Created on 2022-12-09 with reprex v2.0.2}

CodePudding user response：

You can use str_count to count inside of a string:

library(stringr)
mytxtfile = "aaaaaaaaaaaaaaaadddddddddddddssssssssssbbbbbbbbbbbbbccccccccccxxxxxxxxxxsddddddaaasdadvvvvvvvvbbbbbbbaxxxxnnnnnnnnnwwwwwwwwwwwwww"
str_count(mytxtfile, "a")
#[1] 21

Using str_split_* functions, you can use str_split_1 (stringr 1.5.0 required) str_detect sum (although as you see it str_count is much better):

str_split_1(mytxtfile, "") |>
  str_detect("a") |>
  sum()
#[1] 21

CodePudding user response：

Adding to the options, removing all but a's and counting the chars:

nchar(gsub("[^a]", "", mytxtfile))

Output:

[1] 21

CodePudding user response：

In case you want to count the number of times a character is repeated in sequence not just for one character but for all characters in the string, here's a tidyverse solution:

library(tidyverse)
data.frame(strng) %>%
  mutate(
    # 1. split `strng` any time next char is not the same as prior char:
    sameChar = str_split(strng, "(?<=(.))(?!\\1|$)"),
    # 2. count number of chars in each element of list:
    sameCharN = lapply(sameChar, function(x) nchar(x)))
    strng
1 aaaaaaaaaaaaaaaadddddddddddddssssssssssbbbbbbbbbbbbbccccccccccxxxxxxxxxxsddddddaaasdadvvvvvvvvbbbbbbbaxxxxnnnnnnnnnwwwwwwwwwwwwww
                                                                                                                                                               sameChar
1 aaaaaaaaaaaaaaaa, ddddddddddddd, ssssssssss, bbbbbbbbbbbbb, cccccccccc, xxxxxxxxxx, s, dddddd, aaa, s, d, a, d, vvvvvvvv, bbbbbbb, a, xxxx, nnnnnnnnn, wwwwwwwwwwwwww
                                                       sameCharN
1 16, 13, 10, 13, 10, 10, 1, 6, 3, 1, 1, 1, 1, 8, 7, 1, 4, 9, 14

Here's how the str_split regex works:

(?<=(.)): positive look-behind defining any character (.) occurring to the left of the split point as a capture group (in parentheses)
(?!\\1|$): postive look-ahead asserting that the split must not (!) be performed if the immediately next char is the same (\\1) as the char immediately prior OR (|) the final position in the string

Data:

strng <- "aaaaaaaaaaaaaaaadddddddddddddssssssssssbbbbbbbbbbbbbccccccccccxxxxxxxxxxsddddddaaasdadvvvvvvvvbbbbbbbaxxxxnnnnnnnnnwwwwwwwwwwwwww"