I want to clean a column of an R dataframe that contains a mixture of values:
chromosome_count |
---|
26 |
54 |
c.36 |
28-30 |
12, 24 |
so that it looks like this, with comma separated values split into two rows and keeping only the minimum values where a range is recorded:
chromosome_count |
---|
26 |
54 |
36 |
28 |
12 |
24 |
I'm a very stumped beginner and any advice would be very appreciated.
CodePudding user response:
require(tidyverse)
df <- tibble(
chromo = c(26,
54,
"c.36",
"28-30",
"12, 24")
)
df %>%
separate_rows(chromo) %>%
mutate(chromo = chromo %>%
str_replace_all("[^0-9]", ""))
# A tibble: 7 x 1
chromo
<chr>
1 26
2 54
3 36
4 28
5 30
6 12
7 24
CodePudding user response:
You could use regular expressions. ie remove from the string the -30
ie use a look behind, and if its a number, delete the end part of the range. This solution assumes the range is ordered min-max
. Also delete anything from the start of a line that is not a digit
df %>%
mutate(chromosome_count = str_remove(chromosome_count, "(?<=\d)-\\d |^\\D ")) %>%
separate_rows(chromosome_count, convert = TRUE)
# A tibble: 6 x 1
chromosome_count
<int>
1 26
2 54
3 36
4 28
5 12
6 24