Home > Blockchain >  Group similar strings with numbers and keep order of first appearance
Group similar strings with numbers and keep order of first appearance

Time:09-06

I have a dataframe which looks like this example (just much larger):

var <- c('Peter','Ben','Mary','Peter.1','Ben.1','Mary.1','Peter.2','Ben.2','Mary.2')
v1 <- c(0.4, 0.6, 0.7, 0.3, 0.9, 0.2, 0.4, 0.6, 0.7)
v2 <- c(0.5, 0.4, 0.2, 0.5, 0.4, 0.2, 0.1, 0.4, 0.2)
df <- data.frame(var, v1, v2)

      var  v1  v2
1   Peter 0.4 0.5
2     Ben 0.6 0.4
3    Mary 0.7 0.2
4 Peter.1 0.3 0.5
5   Ben.1 0.9 0.4
6  Mary.1 0.2 0.2
7 Peter.2 0.4 0.1
8   Ben.2 0.6 0.4
9  Mary.2 0.7 0.2

I want to group the strings in 'var' according to the names without the suffixes, and keep the original order of first appearance. Desired output:

      var  v1  v2
1 Peter   0.4 0.5 # Peter appears first in the original data
2 Peter.1 0.3 0.5
3 Peter.2 0.4 0.1
4     Ben 0.6 0.4 # Ben appears second in the original data
5   Ben.1 0.9 0.4
6   Ben.2 0.6 0.4
7    Mary 0.7 0.2 # Mary appears third in the original data
8  Mary.1 0.2 0.2
9  Mary.2 0.7 0.2

How can I achieve that?

Thank you!

CodePudding user response:

An option is to create a temporary column without the . and the digits (\\d ) at the end with str_remove, then use factor with levels specified as the unique values or use match to arrange the data

library(dplyr)
library(stringr)
df <- df %>%
   mutate(var1 = str_remove(var, "\\.\\d $")) %>% 
   arrange(factor(var1, levels = unique(var1))) %>%
   select(-var1)

Or use fct_inorder from forcats which will convert to factor with levels in the order of first appearance

library(forcats)
df %>% 
   arrange(fct_inorder(str_remove(var, "\\.\\d $")))

-output

     var  v1  v2
1   Peter 0.4 0.5
2 Peter.1 0.3 0.5
3 Peter.2 0.4 0.1
4     Ben 0.6 0.4
5   Ben.1 0.9 0.4
6   Ben.2 0.6 0.4
7    Mary 0.7 0.2
8  Mary.1 0.2 0.2
9  Mary.2 0.7 0.2

CodePudding user response:

If you don't mind that the values in var are ordered alphabetically, then the simplest solution is this:

df %>%
  arrange(var)
      var  v1  v2
1     Ben 0.6 0.4
2   Ben.1 0.9 0.4
3   Ben.2 0.6 0.4
4    Mary 0.7 0.2
5  Mary.1 0.2 0.2
6  Mary.2 0.7 0.2
7   Peter 0.4 0.5
8 Peter.1 0.3 0.5
9 Peter.2 0.4 0.1

CodePudding user response:

Compact option with sub and data.table::chgroup:

df[chgroup(sub("\\..", "", df$var)),]

      var  v1  v2
1   Peter 0.4 0.5
4 Peter.1 0.3 0.5
7 Peter.2 0.4 0.1
2     Ben 0.6 0.4
5   Ben.1 0.9 0.4
8   Ben.2 0.6 0.4
3    Mary 0.7 0.2
6  Mary.1 0.2 0.2
9  Mary.2 0.7 0.2

CodePudding user response:

1) separate var into two columns, replace the NAs with 0, sort and remove the extra columns. This works even if the number of digits after the dot exceeds 1.

library(dplyr)
library(tidyr)

df %>%
  separate(var, c("alpha", "no"), convert=TRUE, remove=FALSE, fill="right") %>% 
  mutate(no = replace_na(no, 0)) %>%
  arrange(alpha, no) %>%
  select(-alpha, -no)

giving:

      var  v1  v2
1     Ben 0.6 0.4
2   Ben.1 0.9 0.4
3   Ben.2 0.6 0.4
4    Mary 0.7 0.2
5  Mary.1 0.2 0.2
6  Mary.2 0.7 0.2
7   Peter 0.4 0.5
8 Peter.1 0.3 0.5
9 Peter.2 0.4 0.1

2) If you prefer to order it by the order of the first occurrence in each group then change the arrange line to the one shown below. The code is the same as above except for the arrange line.

df %>%
  separate(var, c("alpha", "no"), convert=TRUE, remove=FALSE, fill="right") %>% 
  mutate(no = replace_na(no, 0)) %>%
  arrange(match(alpha, alpha), no) %>%
  select(-alpha, -no)

giving

      var  v1  v2
1   Peter 0.4 0.5
2 Peter.1 0.3 0.5
3 Peter.2 0.4 0.1
4     Ben 0.6 0.4
5   Ben.1 0.9 0.4
6   Ben.2 0.6 0.4
7    Mary 0.7 0.2
8  Mary.1 0.2 0.2
9  Mary.2 0.7 0.2
  • Related