I am still new to R programming and I just have no idea how to write this same code below from python to R.
human_data is dataframe from CSV file. the word includes sequence of letters. Basically, I want to convert my 'word' column sequence of string into all possible k-mer words of length 6.
def getKmers(sequence, size=6):
return [sequence[x:x size] for x in range(len(sequence) - size 1)]
human_data['words'] = human_data.apply(lambda x: getKmers(x['sequence']), axis=1)
CodePudding user response:
You could use the library quanteda
too, in order to compute the k-mers
(k-grams
), the following code shows an example:
library(quanteda)
k = 6 # 6-mers
human_data = data.frame(sequence=c('abcdefghijkl', 'xxxxyyxxyzz'))
human_data$words <- apply(human_data, 1,
function(x) char_ngrams(unlist(tokens(x['sequence'],
'character')), n=k, concatenator = ''))
human_data
# sequence words
#1 abcdefghijkl abcdef, bcdefg, cdefgh, defghi, efghij, fghijk, ghijkl
#2 xxxxyyxxyzz xxxxyy, xxxyyx, xxyyxx, xyyxxy, yyxxyz, yxxyzz
CodePudding user response:
I hope this helps, using R
basic commands:
df = data.frame(words=c('asfdklajsjahk', 'dkajsadjkfggfh', 'kfjlhdaDDDhlw'))
getKmers = function(sequence, size=6) {
kmers = c()
for (x in 1:(nchar(sequence) - size 1)) {
kmers = c(kmers, substr(sequence, x, x size-1))
}
return(kmers)
}
sapply(df$words, getKmers)