K-mer words in R-CodePudding

I am still new to R programming and I just have no idea how to write this same code below from python to R.

human_data is dataframe from CSV file. the word includes sequence of letters. Basically, I want to convert my 'word' column sequence of string into all possible k-mer words of length 6.

def getKmers(sequence, size=6):
    return [sequence[x:x size] for x in range(len(sequence) - size   1)]

human_data['words'] = human_data.apply(lambda x: getKmers(x['sequence']), axis=1)

CodePudding user response：

You could use the library quanteda too, in order to compute the k-mers (k-grams), the following code shows an example:

library(quanteda)
k = 6 # 6-mers
human_data = data.frame(sequence=c('abcdefghijkl', 'xxxxyyxxyzz'))
human_data$words <- apply(human_data, 1, 
                          function(x) char_ngrams(unlist(tokens(x['sequence'], 
                                      'character')), n=k, concatenator = ''))
human_data
#      sequence                                                  words
#1 abcdefghijkl abcdef, bcdefg, cdefgh, defghi, efghij, fghijk, ghijkl
#2  xxxxyyxxyzz         xxxxyy, xxxyyx, xxyyxx, xyyxxy, yyxxyz, yxxyzz

CodePudding user response：

I hope this helps, using R basic commands:

df = data.frame(words=c('asfdklajsjahk', 'dkajsadjkfggfh', 'kfjlhdaDDDhlw'))


getKmers = function(sequence, size=6) {
    kmers = c()
    for (x in 1:(nchar(sequence) - size   1)) {
        kmers = c(kmers, substr(sequence, x, x size-1))
    }
    return(kmers)
}

sapply(df$words, getKmers)