I have a large data frame called data_frame
with two columns PRE
and STATUS
that look like this:
PRE STATUS
1_752566 GAINED
1_776546 LOST
1_832918 NA
1_842013 LOST
1_846864 GAINED
11_8122943 NA
11_8188699 GAINED
11_8321128 NA
23_95137734 NA
23_95146814 GAINED
What I would like is to create a new column CHR
with only the number(s) before the underscore and make sure they are matched up next to the original column correctly like this:
PRE STATUS CHR
1_752566 GAINED 1
1_776546 LOST 1
1_832918 NA 1
1_842013 LOST 1
1_846864 GAINED 1
11_8122943 NA 11
11_8188699 GAINED 11
11_8321128 NA 11
23_95137734 NA 23
23_95146814 GAINED 23
From here I'd like to group CHR
by number and then find the sum of each group. If possible, I would like a new data table showing the sums of each group number like this:
NUM SUM
1 5
11 3
23 2
I would then plot this to visualize the sums of each number where my x-axis is NUM
and my y-axis is SUM
CodePudding user response:
We could extract the prefix before the _
library(dplyr)
df <- df %>%
mutate(CHR = trimws(PRE, whitespace = "_.*"))
-output
df
PRE STATUS CHR
1 1_752566 GAINED 1
2 1_776546 LOST 1
3 1_832918 <NA> 1
4 1_842013 LOST 1
5 1_846864 GAINED 1
6 11_8122943 <NA> 11
7 11_8188699 GAINED 11
8 11_8321128 <NA> 11
9 23_95137734 <NA> 23
10 23_95146814 GAINED 23
Then, based on the expected output showed, it will be the count
we need
df %>%
count(CHR, name = "SUM")
CHR SUM
1 1 5
2 11 3
3 23 2
For the plot, do
library(ggplot2)
df %>%
count(CHR, name = "SUM") %>%
ggplot(aes(x = CHR, y = SUM))
geom_col()
data
df <- structure(list(PRE = c("1_752566", "1_776546", "1_832918", "1_842013",
"1_846864", "11_8122943", "11_8188699", "11_8321128", "23_95137734",
"23_95146814"), STATUS = c("GAINED", "LOST", NA, "LOST", "GAINED",
NA, "GAINED", NA, NA, "GAINED")), class = "data.frame", row.names = c(NA,
-10L))
CodePudding user response:
Here is an alternative approach using parse_number()
from readr
package:
parse_number
will extract only the first numbers. Here only the number before _
:
library(dplyr)
library(readr)
df %>%
group_by(CHR = parse_number(PRE)) %>%
summarise(NUM = first(CHR), SUM =n()) %>%
select(-CHR)
NUM SUM
<dbl> <int>
1 1 5
2 11 3
3 23 2