How to sum up different strings in a grouped dataframe and spread them horizontally with incomplete-CodePudding

This question is related to:

In R: How to extract a specific (e.g. last) value from a dataframe with multiple rows belonging to one person?

Context:

I have a dataframe that contains variables from a collection of follow-up visits from patients after transplantation.

There are now 3 relevant variables: 1) the patient identification, 2) the unique transplantation identification, 3) the number of days between transplantation and follow-up

Each row is a follow-up visit. There can be multiple follow-ups for each patient in the data frame. The amount of follow-ups for each patient vary.

The transplantation identification is missing for some follow-ups. Sometimes patients have had two transplantations so the transplantations IDs differ. It is not clear what the maximum number of different transplantation IDs per patient are.

Questions:

First I would like to know, which patients have zero, one, two... etc. transplantation IDs.

After that I would like to extract the transplantation identification and put it into a separate column for all follow ups of a specific patient.

Example:

patient_ID <- c("A", "A", "A", "A", "B", "B", "C", "C", "C")
tx_ID <- c("123", NA, NA, "123", NA, NA, "456", NA, "789")
days_from_tx <- c(0, 5, 10, 15, 2, 4, 1, 2, 3)
df <- data.frame(patient_ID, tx_ID, days_from_tx)

  patient_ID tx_ID days_from_tx
1          A   123            0
2          A  <NA>            5
3          A  <NA>           10
4          A   123           15
5          B  <NA>            2
6          B  <NA>            4
7          C   456            3
8          C  <NA>           10
9          C   789            2

What I would like to have:

 patient_ID tx_ID days_from_tx nr_tx_ID  tx1  tx2
1          A   123            0        1  123 <NA>
2          A  <NA>            5        1  123 <NA>
3          A  <NA>           10        1  123 <NA>
4          A   123           15        1  123 <NA>
5          B  <NA>            2        0 <NA> <NA>
6          B  <NA>            4        0 <NA> <NA>
7          C   456            3        2  456  789
8          C  <NA>           10        2  456  789
9          C   789            2        2  456  789

What I have tried:

I tried to group the patients, then find out the unique transplantation IDs and write them in a new column. I used the code from the answer to my last question and modified it slightly.

df %>%
  group_by(patient_ID) %>%
  mutate(
    last_followup = n_distinct(tx_ID, na.rm = TRUE)
  )

Unfortunatley this did not work. R prints the number of unique transplantation IDs of all patients into the column.

CodePudding user response：

You could use mutate() to create the nr_tx_id column, and separately, using filter() and distinct() with pivot_wider()` to get the distincte tx_IDs in wide format, and then join the two results:

df %>% 
  group_by(patient_ID) %>% 
  mutate(nr_tx_ID = length(unique(tx_ID[!is.na(tx_ID)]))) %>% 
  ungroup()
  left_join(
    df %>% 
      filter(!is.na(tx_ID)) %>%
      distinct(patient_ID,tx_ID) %>% 
      group_by(patient_ID) %>% 
      mutate(id = row_number()) %>% 
      ungroup()
      pivot_wider(names_from = id, values_from = tx_ID,names_prefix = "tx")
  )

Output:

  patient_ID tx_ID days_from_tx nr_tx_ID tx1   tx2  
  <chr>      <chr>        <dbl>    <int> <chr> <chr>
1 A          123              0        1 123   NA   
2 A          NA               5        1 123   NA   
3 A          NA              10        1 123   NA   
4 A          123             15        1 123   NA   
5 B          NA               2        0 NA    NA   
6 B          NA               4        0 NA    NA   
7 C          456              1        2 456   789  
8 C          NA               2        2 456   789  
9 C          789              3        2 456   789

CodePudding user response：

Using by and unsplit.

by(df, df$patient_ID, \(x) {
  tx <- sort(unique(x$tx_ID))
  cbind(x, nr_tx_ID=length(tx), tx1=tx[1], tx2=tx[2])
}) |> unsplit(df$patient_ID)
#   patient_ID tx_ID days_from_tx nr_tx_ID  tx1  tx2
# 1          A   123            0        1  123 <NA>
# 2          A  <NA>            5        1  123 <NA>
# 3          A  <NA>           10        1  123 <NA>
# 4          A   123           15        1  123 <NA>
# 5          B  <NA>            2        0 <NA> <NA>
# 6          B  <NA>            4        0 <NA> <NA>
# 7          C   456            1        2  456  789
# 8          C  <NA>            2        2  456  789
# 9          C   789            3        2  456  789

Data:

df <- structure(list(patient_ID = c("A", "A", "A", "A", "B", "B", "C", 
"C", "C"), tx_ID = c("123", NA, NA, "123", NA, NA, "456", NA, 
"789"), days_from_tx = c(0, 5, 10, 15, 2, 4, 1, 2, 3)), class = "data.frame", row.names = c(NA, 
-9L))