R: How can I clean a dataframe column to remove unwanted characters and separate multiple values?-CodePudding

I want to clean a column of an R dataframe that contains a mixture of values:

chromosome_count
26
54
c.36
28-30
12, 24

so that it looks like this, with comma separated values split into two rows and keeping only the minimum values where a range is recorded:

chromosome_count
26
54
36
28
12
24

I'm a very stumped beginner and any advice would be very appreciated.

CodePudding user response：

require(tidyverse)

df <- tibble(
  chromo = c(26, 
             54, 
             "c.36", 
             "28-30", 
             "12, 24")
) 

df %>%
  separate_rows(chromo) %>%
  mutate(chromo = chromo %>%
           str_replace_all("[^0-9]", ""))

# A tibble: 7 x 1
  chromo
  <chr> 
1 26    
2 54    
3 36    
4 28    
5 30    
6 12    
7 24

CodePudding user response：

You could use regular expressions. ie remove from the string the -30 ie use a look behind, and if its a number, delete the end part of the range. This solution assumes the range is ordered min-max. Also delete anything from the start of a line that is not a digit

df %>%
  mutate(chromosome_count = str_remove(chromosome_count, "(?<=\d)-\\d |^\\D ")) %>%
  separate_rows(chromosome_count, convert = TRUE)

# A tibble: 6 x 1
  chromosome_count
             <int>
1               26
2               54
3               36
4               28
5               12
6               24