Home > Net >  Is there an elegant way to replace NAs with values from a corresponding column, for multiple columns
Is there an elegant way to replace NAs with values from a corresponding column, for multiple columns

Time:03-21

I'm working with a dataframe of trial participant blood test results, with some sporadic missing values (analyte failed). Fortunately we have two time points quite close together, so for missing values at timepoint 1, i'm hoping to impute the corresponding value from timepoint 2. I am just wondering, if there is an elegant way to code this in R/tidyverse for multiple test results?

Here is some sample data:

           timepoint = c(1,1,1,1,1,2,2,2,2,2),
           fst_test = c(NA,sample(1:40,9, replace =F)),
           scd_test = c(sample(1:20,8, replace = F),NA,NA))

So far I have been pivoting wider, then manually coalescing the corresponding test results, like so:

test %>% 
  pivot_wider(names_from = timepoint, 
              values_from = fst_test:scd_test) %>%
  mutate(fst_test_imputed = coalesce(fst_test_1, fst_test_2),
         scd_test_imputed = coalesce(scd_test_1, scd_test_2)) %>% 
  select(ID, fst_test_imputed, scd_test_imputed)

However for 15 tests this is cumbersome... I thought there might be an elegant R / dplyr solution for this situation?

Many thanks in advance for your help!!

CodePudding user response:

We could use fill after creating a grouping column with rowid on the 'timepoint' (as the OP mentioned to replace with corresponding data point in 'timepoint' column). Then, we just need fill and specify the .direction as "updown" to fill NA in the preceding value with the succeeding non-NA first (if it should be only to take care of 'NA' in 'timepoint' 1, then change the .direction = "up")

library(dplyr)
library(tidyr)
library(data.table)
test %>%
    group_by(grp = rowid(timepoint)) %>%
    fill(fst_test, scd_test, .direction = "updown") %>%
    ungroup %>% 
    select(-grp)

data

test <- structure(list(timepoint = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2), 
fst_test = c(NA, 
16L, 30L, 29L, 14L, 32L, 21L, 20L, 3L, 23L), scd_test = c(18L, 
17L, 8L, 20L, 1L, 10L, 14L, 19L, NA, NA)),
 class = "data.frame", row.names = c(NA, 
-10L))

CodePudding user response:

You could pivot your data so that "timepoint" defines the columns, with all your tests on the rows. In order to perform this pivot without creating list-cols, we'll have to group by "timepoint" and create an index for each row within the group:

test <- tibble(
  timepoint = c(1,1,1,1,1,2,2,2,2,2),
  fst_test = c(NA,sample(1:40,9, replace =F)),
  scd_test = c(sample(1:20,8, replace = F),NA,NA))
)

test_pivoted <- test %>% 
  group_by(timepoint) %>% 
  mutate(idx = row_number()) %>% 
  pivot_longer(-c(timepoint, idx)) %>% 
  pivot_wider(names_from = timepoint, values_from = value, names_prefix = 'timepoint')

     idx name     timepoint1 timepoint2
   <int> <chr>         <int>      <int>
 1     1 fst_test         NA         39
 2     1 scd_test          5         10
 3     2 fst_test         37          7
 4     2 scd_test         20          3
 5     3 fst_test          5         26
 6     3 scd_test         19         11
 7     4 fst_test         17         28
 8     4 scd_test          9         NA
 9     5 fst_test         14         32
10     5 scd_test          8         NA

Now we can coalesce once across the two timepoints for all tests:

test_pivoted %>%   
  mutate(
    imputed = coalesce(timepoint1, timepoint2)
  )

     idx name     timepoint1 timepoint2 imputed
   <int> <chr>         <int>      <int>   <int>
 1     1 fst_test         NA         39      39
 2     1 scd_test          5         10       5
 3     2 fst_test         37          7      37
 4     2 scd_test         20          3      20
 5     3 fst_test          5         26       5
 6     3 scd_test         19         11      19
 7     4 fst_test         17         28      17
 8     4 scd_test          9         NA       9
 9     5 fst_test         14         32      14
10     5 scd_test          8         NA       8

And if you wanted to clean up the result a little more:

test_pivoted %>%   
  mutate(
    imputed = coalesce(timepoint1, timepoint2)
  ) %>% 
  select(name, idx, imputed) %>% 
  pivot_wider(names_from = name, values_from = imputed)

    idx fst_test scd_test
  <int>    <int>    <int>
1     1       39        5
2     2       37       20
3     3        5       19
4     4       17        9
5     5       14        8
  • Related