Home > Software engineering >  How to count the occurrence of a word in multiple variables in R and sort them from highest to lowes
How to count the occurrence of a word in multiple variables in R and sort them from highest to lowes

Time:06-14

I have a huge dataset with over 3 million obs and 108 columns. There are 14 variables I'm interested in: DIAG_PRINC, DIAG_SECUN, DIAGSEC1:DIAGSEC9, CID_ASSO, CID_MORTE and CID_NOTIF (they're in different positions). These variables contain ICD-10 codes.

I'm interested in counting how many times certain ICD-10 codes appear and then sort them from highest to lowest in a dataframe. Here's some reproductible data:

data <- data.frame(DIAG_PRINC = c("O200", "O200", "O230"),
                   DIAG_SECUN = c("O555", "O530", "O890"),
                   DIAGSEC1 = c("O766", "O876", "O899"),
                   DIAGSEC2 = c("O200", "I520", "O200"),
                   DIAGSEC3 = c("O233", "O200", "O620"),
                   DIAGSEC4 = c("O060", "O061", "O622"),
                   DIAGSEC5 = c("O540", "O123", "O344"),
                   DIAGSEC6 = c("O876", "Y321", "S333"),
                   DIAGSEC7 = c("O450", "X900", "O541"),
                   DIAGSEC8 = c("O222", "O111", "O123"),
                   DIAGSEC9 = c("O987", "O123", "O622"),
                   CID_MORTE = c("O066", "O699", "O555"),
                   CID_ASSO = c("O600", "O060", "O068"),
                   CID_NOTIF = c("O069", "O066", "O065"))

I also have a list of ICD-10 codes that I'm interested in counting.

GRUPO1 <- c("O00", "O000", "O001", "O002", "O008", "O009",
            "O01", "O010", "O011", "O019",
            "O02", "O020", "O021", "O028", "O029",
            "O03", "O030", "O031", "O032", "O033", "O034", "O035", "O036", "O037",
            "O038", "O039",
            "O04", "O040", "O041", "O042", "O043", "O044", "O045", "O046", "O047", 
            "O048", "O049",
            "O05", "O050", "O051", "O052", "O053", "O054", "O055", "O056", "O057", 
            "O058", "O059",
            "O06", "O060", "O061", "O062", "O063", "O064", "O065", "O066", "O067", 
            "O068", "O069",
            "O07", "O070", "O071", "O072", "O073", "O074", "O075", "O076", "O077", 
            "O078", "O079",
            "O08", "O080", "O081", "O082", "O083", "O084", "O085", "O086", "O087", 
            "O088", "O089")

What I need is a dataframe counting how many times the ICD-10 codes from "GRUPO1" appear in any row/column from DIAG_PRINC, DIAG_SECUN, DIAGSEC1:DIAGSEC9, CID_ASSO, CID_MORTE and CID_NOTIF variables. For example, on my reproductible data ICD-10 cod "O066" appears twice.

Thank you in advance!

CodePudding user response:

We can unlist the data into a vector, use %in% to subset the values from 'GRUPO1' and get the frequency count with table in base R

v1 <- unlist(data)
out <- table(v1[v1 %in% GRUPO1])
out[order(-out)]

O060 O066 O061 O065 O068 O069 
   2    2    1    1    1    1 

CodePudding user response:

Here is a tidyverse solution using tidyr and dplyr:

library(tidyverse) 

pivot_longer(data, everything()) %>% 
  filter(value %in% GRUPO1) %>% 
  count(value)

Output

  value     n
  <chr> <int>
1 O060      2
2 O061      1
3 O065      1
4 O066      2
5 O068      1
6 O069      1
  • Related