Home > Back-end >  How to group similar strings with numbers in R?
How to group similar strings with numbers in R?

Time:09-06

I have a dataframe which looks like this example (just much larger):

var <- c('Peter','Ben','Mary','Peter.1','Ben.1','Mary.1','Peter.2','Ben.2','Mary.2')
v1 <- c(0.4, 0.6, 0.7, 0.3, 0.9, 0.2, 0.4, 0.6, 0.7)
v2 <- c(0.5, 0.4, 0.2, 0.5, 0.4, 0.2, 0.1, 0.4, 0.2)
df <- data.frame(var, v1, v2)

      var  v1  v2
1   Peter 0.4 0.5
2     Ben 0.6 0.4
3    Mary 0.7 0.2
4 Peter.1 0.3 0.5
5   Ben.1 0.9 0.4
6  Mary.1 0.2 0.2
7 Peter.2 0.4 0.1
8   Ben.2 0.6 0.4
9  Mary.2 0.7 0.2

I want to group the strings in var, so that my output would like this:

      var  v1  v2
1   Peter 0.4 0.5
2 Peter.1 0.3 0.5
3 Peter.2 0.4 0.1
4     Ben 0.6 0.4
5   Ben.1 0.9 0.4
6   Ben.2 0.6 0.4
7    Mary 0.7 0.2
8  Mary.1 0.2 0.2
9  Mary.2 0.7 0.2

How can I achieve that?

Thank you!

CodePudding user response:

An option is to create a temporary column without the . and the digits (\\d ) at the end with str_remove, then use factor with levels specified as the unique values or use match to arrange the data

library(dplyr)
library(stringr)
df <- df %>%
   mutate(var1 = str_remove(var, "\\.\\d $")) %>% 
   arrange(factor(var1, levels = unique(var1))) %>%
   select(-var1)

Or use fct_inorder from forcats which will convert to factor with levels in the order of first appearance

library(forcats)
df %>% 
   arrange(fct_inorder(str_remove(var, "\\.\\d $")))

-output

     var  v1  v2
1   Peter 0.4 0.5
2 Peter.1 0.3 0.5
3 Peter.2 0.4 0.1
4     Ben 0.6 0.4
5   Ben.1 0.9 0.4
6   Ben.2 0.6 0.4
7    Mary 0.7 0.2
8  Mary.1 0.2 0.2
9  Mary.2 0.7 0.2

CodePudding user response:

separate var into two columns, replace the NAs with 0, sort and remove the extra columns.

library(dplyr)
library(tidyr)

df %>%
  separate(var, c("alpha", "no"), convert=TRUE, remove=FALSE, fill="right") %>% 
  mutate(no = replace_na(no, 0)) %>%
  arrange(alpha, no) %>%
  select(-alpha, -no)

giving:

      var  v1  v2
1     Ben 0.6 0.4
2   Ben.1 0.9 0.4
3   Ben.2 0.6 0.4
4    Mary 0.7 0.2
5  Mary.1 0.2 0.2
6  Mary.2 0.7 0.2
7   Peter 0.4 0.5
8 Peter.1 0.3 0.5
9 Peter.2 0.4 0.1

CodePudding user response:

If you don't mind that the values in var are ordered alphabetically, then the simplest solution is this:

df %>%
  arrange(var)
      var  v1  v2
1     Ben 0.6 0.4
2   Ben.1 0.9 0.4
3   Ben.2 0.6 0.4
4    Mary 0.7 0.2
5  Mary.1 0.2 0.2
6  Mary.2 0.7 0.2
7   Peter 0.4 0.5
8 Peter.1 0.3 0.5
9 Peter.2 0.4 0.1
  • Related