How to summarise long data into new wide format variables and keep improtant group information in R-CodePudding

Sorry if this has been answered, I did look though other posted quetions which helped me get to where I am but am now struggling, I am new to R/ computers and any help would be massively appreciated!

I have a dataset on which each row represents a subject. The data set is very "wide" with roughtly 200 variables (columns). I also have a "long" data set with laboratory testing results performed on samples collected from each subject. I have merged these data sets so that now each subject can appear multiple times i.e have multile rows depending on how many lab samples were recieved and tested.

I now want to try to analyse how many samples of each sample type were sent for each subject. Below is a simplified example which I hope helps explain.

#example data frame
sample_type <- c("blood", "blood", "sputum", "blood", "csf", "blood", "csf", "sputum", "sputum", "sputum", "sputum", "blood", "csf", "csf")
id <- c(1,1,1,2,3,4,4,5,5,5,6,6,7,7)
example_data <- data.frame(id, sample_type)

   id sample_type
1   1       blood
2   1       blood
3   1      sputum
4   2       blood
5   3         csf
6   4       blood
7   4         csf
8   5      sputum
9   5      sputum
10  5      sputum
11  6      sputum
12  6       blood
etc.

Here I have tried to create new variables. This works ok if I omit the "distict" function at the end, however when the distict function collapses the output it appears to select only one sample_type and then omits data in the other colums

example_data %>% 
  add_count(id, sample_type, name = "test_freq") %>% 
  mutate(blood_freq = case_when(sample_type == "blood" ~ test_freq),
         sputum_freq = case_when(sample_type == "sputum" ~ test_freq),
         csf_freq = case_when(sample_type == "csf" ~ test_freq)) %>% 
  distinct(id, .keep_all = 
             T)

Without the distict function, patient (id) no. 1 for example has 2 in the blood_freq column and a one in the sputum_freq column

   id sample_type test_freq blood_freq sputum_freq csf_freq
1   1       blood         2          2          NA       NA
2   1       blood         2          2          NA       NA
3   1      sputum         1         NA           1       NA
4   2       blood         1          1          NA       NA
5   3         csf         1         NA          NA        1
6   4       blood         1          1          NA       NA

with the distict function added, the blood_freq is corect but the sputum_freq is now NA

  id sample_type test_freq blood_freq sputum_freq csf_freq
1  1       blood         2          2          NA       NA
2  2       blood         1          1          NA       NA
3  3         csf         1         NA          NA        1
4  4       blood         1          1          NA       NA

Below is the closest I have come however the output is slightly odd as the values from "test_freq" column are not unique. I am also worried that I will have to define what I want doing with the rest of my data in the many other columns if I start pivoting longer? but perhaps this is not something I should be worried about?

example_data %>% 
  add_count(id, sample_type, name = "test_freq")  %>% 
  pivot_wider(names_from = sample_type, values_from = test_freq) %>% 
  distinct(id, .keep_all = TRUE)

     id blood     sputum    csf      
  <dbl> <list>    <list>    <list>   
1     1 <int [2]> <int [1]> <NULL>   
2     2 <int [1]> <NULL>    <NULL>   
3     3 <NULL>    <NULL>    <int [1]>
4     4 <int [1]> <NULL>    <int [1]>
5     5 <NULL>    <int [3]> <NULL>   
6     6 <int [1]> <int [1]> <NULL>   
7     7 <NULL>    <NULL>    <int [2]>

Ultimately I would like to say that x many subjects had y many sputum tests sent, and x many had y many blood and z many sputums etc.

I found that this code gives a good overview of tests sent but does not allow me to see how many patients had one blood and one sputum for example, so I am tryin got transfor the data as shown in these examples to better analyse it but just cant quite get there!

> example_data %>% group_by(id) %>% count(sample_type) %$% table(sample_type,n)
           n
sample_type 1 2 3
     blood  3 1 0
     csf    2 1 0
     sputum 2 0 1

I hope this makes sense? I am open to all suggestions or even any other ideas that achieves this in a better way.

Many thanks!!

CodePudding user response：

Perhaps something like this?

I bind one version of the data that counts each id-sample_type with another version of the data that just counts each id (this version called "test"). Then we can pivot the combined data set to provide the summary columns of your example.

library(tidyverse)
bind_rows(
  example_data %>% count(id, sample_type),
  example_data %>% count(id) %>%
    mutate(sample_type = "test")) %>%
  pivot_wider(names_from = sample_type, values_from = n, 
              names_glue = "{sample_type}_freq")

Result

# A tibble: 7 x 5
     id blood_freq sputum_freq csf_freq test_freq
  <dbl>      <int>       <int>    <int>     <int>
1     1          2           1       NA         3
2     2          1          NA       NA         1
3     3         NA          NA        1         1
4     4          1          NA        1         2
5     5         NA           3       NA         3
6     6          1           1       NA         2
7     7         NA          NA        2         2

To the latter part of your question, you write "Ultimately I would like to say that x many subjects had y many sputum tests sent, and x many had y many blood and z many sputums etc."

You could do something like:

example_data %>% 
  count(id, sample_type) %>%
  count(sample_type, n)

Result

Storing counts in `nn`, as `n` already present in input
ℹ Use `name = "new_name"` to pick a new name.
  sample_type n nn
1       blood 1  3
2       blood 2  1
3         csf 1  2
4         csf 2  1
5      sputum 1  2
6      sputum 3  1

This tells us that 3 id's had one blood test, 1 id had 2 blood tests, 2 id's had 1 csf test, etc.

Or if we are interested in combinations of blood and sputum tests:

example_data %>% 
  count(id, sample_type) %>%
  pivot_wider(names_from = sample_type, values_from = n) %>%
  count(blood, sputum)

Result

# A tibble: 5 x 3
  blood sputum     n
  <int>  <int> <int>
1     1      1     1
2     1     NA     2
3     2      1     1
4    NA      3     1
5    NA     NA     2

This tells us only one id had 2 blood and 1 sputum test, but 2 had 1 test and no sputum. (btw you could add values_fill = 0 to the pivot_wider if you want to replace NAs with 0s.)