I'm looking for a simple and efficient way to extract general patterns of how elements repeat in a given vector (to later compare various vectors). So this should allow for example to correct common problems, where some people would call an observation color "purple" while others would call it "violet" etc. However, I need something very general and flexible, simple substitution of terms is not realistic (since I don't know in advance all names that may come up.
Here is an example of 3 vectors, the 1st and 2nd are the same, just the elements were called differently, however the 3rd is different
aa=letters[rep(c(3:1,4),each=2)]
ab=letters[rep(c(5,8:6),each=2)]
ac=letters[c(1:2,1:3,3:4,4)]
I tried
as.numeric(factor(aa,labels=unique(aa)))
as.numeric(factor(ab,labels=unique(ab)))
but as you can see, the result does not allow indicating the same pattern for aa and ab (of always 2 repeats, until moving to another item
Thanks in advance, Wolfgang
CodePudding user response:
You are looking for a run-length encoding function, in R, it is called rle
; which gives you the lengths and values of each consecutive repeated value:
rle(aa)$lengths
# [1] 2 2 2 2
rle(ab)$lengths
# [1] 2 2 2 2
Use data.table::rleid
if you want to create an id along the values:
data.table::rleid(aa)
# [1] 1 1 2 2 3 3 4 4
data.table::rleid(ab)
# [1] 1 1 2 2 3 3 4 4
CodePudding user response:
aa=letters[rep(c(3:1,4),each=2)]
ab=letters[rep(c(5,8:6),each=2)]
ac=letters[c(1:2,1:3,3:4,4)]
Get universe of letters:
all_let<-c(aa,ab,ac)
Create vector of unique letters, and insert letters as name of each element:
uni_let<-unique(all_let)
names(uni_let)<-c(unique(all_let))
Create code id for each letter:
uni_code<-1:length(uni_let)
names(uni_code)<-names(uni_let)
Replace letter by code id, eg:
match(ac,names(uni_code))