I have a huge dataset with over 3 million obs and 108 columns. There are 14 variables I'm interested in: DIAG_PRINC, DIAG_SECUN, DIAGSEC1:DIAGSEC9, CID_ASSO, CID_MORTE and CID_NOTIF (they're in different positions). These variables contain ICD-10 codes.
I'm interested in counting how many times certain ICD-10 codes appear and then sort them from highest to lowest in a dataframe. Here's some reproductible data:
data <- data.frame(DIAG_PRINC = c("O200", "O200", "O230"),
DIAG_SECUN = c("O555", "O530", "O890"),
DIAGSEC1 = c("O766", "O876", "O899"),
DIAGSEC2 = c("O200", "I520", "O200"),
DIAGSEC3 = c("O233", "O200", "O620"),
DIAGSEC4 = c("O060", "O061", "O622"),
DIAGSEC5 = c("O540", "O123", "O344"),
DIAGSEC6 = c("O876", "Y321", "S333"),
DIAGSEC7 = c("O450", "X900", "O541"),
DIAGSEC8 = c("O222", "O111", "O123"),
DIAGSEC9 = c("O987", "O123", "O622"),
CID_MORTE = c("O066", "O699", "O555"),
CID_ASSO = c("O600", "O060", "O068"),
CID_NOTIF = c("O069", "O066", "O065"))
I also have a list of ICD-10 codes that I'm interested in counting.
GRUPO1 <- c("O00", "O000", "O001", "O002", "O008", "O009",
"O01", "O010", "O011", "O019",
"O02", "O020", "O021", "O028", "O029",
"O03", "O030", "O031", "O032", "O033", "O034", "O035", "O036", "O037",
"O038", "O039",
"O04", "O040", "O041", "O042", "O043", "O044", "O045", "O046", "O047",
"O048", "O049",
"O05", "O050", "O051", "O052", "O053", "O054", "O055", "O056", "O057",
"O058", "O059",
"O06", "O060", "O061", "O062", "O063", "O064", "O065", "O066", "O067",
"O068", "O069",
"O07", "O070", "O071", "O072", "O073", "O074", "O075", "O076", "O077",
"O078", "O079",
"O08", "O080", "O081", "O082", "O083", "O084", "O085", "O086", "O087",
"O088", "O089")
What I need is a dataframe counting how many times the ICD-10 codes from "GRUPO1" appear in any row/column from DIAG_PRINC, DIAG_SECUN, DIAGSEC1:DIAGSEC9, CID_ASSO, CID_MORTE and CID_NOTIF variables. For example, on my reproductible data ICD-10 cod "O066" appears twice.
Thank you in advance!
CodePudding user response:
We can unlist
the data into a vector
, use %in%
to subset the values from 'GRUPO1' and get the frequency count with table
in base R
v1 <- unlist(data)
out <- table(v1[v1 %in% GRUPO1])
out[order(-out)]
O060 O066 O061 O065 O068 O069
2 2 1 1 1 1
CodePudding user response:
Here is a tidyverse
solution using tidyr
and dplyr
:
library(tidyverse)
pivot_longer(data, everything()) %>%
filter(value %in% GRUPO1) %>%
count(value)
Output
value n
<chr> <int>
1 O060 2
2 O061 1
3 O065 1
4 O066 2
5 O068 1
6 O069 1