Home > OS >  Make "tidy" tibble dataset with matrix variable
Make "tidy" tibble dataset with matrix variable

Time:08-10

I have a dataset in a one-row-per-person format, but I want to merge in a dataset as a matrix variable -- basically, if not for the need of the tidy format, I would use a fuzzy join to merge in data from a range of dates, creating n x 3 rows. Instead, I really want to merge them in as a matrix variable.

I've tried using the fuzzy join followed by a nest function, but my merge dataset contains several variables, so needing to explicitly state my variables for the nest function would be a ton of work and ultimately look messy (though, I suppose if this is my only way forward, I could commit to this).

By way of providing a reproducible example of what I am thinking, I've created the fake dataset below:

person_data <- tibble(ID=paste0("ID", 1:50), outcome=sample(0:1, 50, T), date=sample(5:145, 50, F),min_date=date-5)
value_data <- tibble(date=seq(1:150), exp1=sample(20:100, 150, T), exp2=sample(20:100, 150, T))

My first thought was to combine fuzzy join and the nest function like so:

merged <- fuzzyjoin::fuzzy_left_join(person_data, value_data, by=c("date" = "date", "min_date" = "date"), match_fun=c(`>=`,`<=`))

analysis_data <- merged %>% 
  select(-date.y) %>% 
  group_by(ID) %>% 
  nest(exp1=exp1, exp2=exp2)

The head of the resulting analysis_data tibble is here.

> analysis_data
# A tibble: 50 × 6
# Groups:   ID [50]
   ID    outcome date.x min_date exp1             exp2            
   <chr>   <int>  <int>    <dbl> <list>           <list>          
 1 ID1         1     39       34 <tibble [6 × 1]> <tibble [6 × 1]>
 2 ID2         1     87       82 <tibble [6 × 1]> <tibble [6 × 1]>
 3 ID3         0     23       18 <tibble [6 × 1]> <tibble [6 × 1]>
 4 ID4         1     60       55 <tibble [6 × 1]> <tibble [6 × 1]>
 5 ID5         1    120      115 <tibble [6 × 1]> <tibble [6 × 1]>
 6 ID6         0     35       30 <tibble [6 × 1]> <tibble [6 × 1]>
 7 ID7         0    131      126 <tibble [6 × 1]> <tibble [6 × 1]>
 8 ID8         0     83       78 <tibble [6 × 1]> <tibble [6 × 1]>
 9 ID9         0     68       63 <tibble [6 × 1]> <tibble [6 × 1]>
10 ID10        1    133      128 <tibble [6 × 1]> <tibble [6 × 1]>

And for the most part, it kind of does get me there because the resulting data has 50 rows but is there a better/simpler way of approaching this problem, especially if I had exp1-50 instead of exp1 and exp2? I'd also like to add an index matrix variable, but I'm unsure how to add that.

Realizing it doesn't fit with the rest of my examples, I basically want a dataset that looks like this in the end:

fake_data_goal <- tibble(ID=paste0("ID", 1:50), outcome=sample(0:1, 50, T), var=matrix(data=sample(50:100, 50*3, T), nrow=50), index=matrix(rep(c(1:3), 50), nrow=50, byrow=T))

> fake_data_goal
# A tibble: 50 × 4
   ID    outcome var[,1]  [,2]  [,3] index[,1]  [,2]  [,3]
   <chr>   <int>   <int> <int> <int>     <int> <int> <int>
 1 ID1         0      66    90    71         1     2     3
 2 ID2         0      95    98    75         1     2     3
 3 ID3         0      57    84    91         1     2     3
 4 ID4         1      78    89    64         1     2     3
 5 ID5         1      97    73    95         1     2     3
 6 ID6         0      52    96    76         1     2     3
 7 ID7         0      62    93    57         1     2     3
 8 ID8         0      62    76    81         1     2     3
 9 ID9         1      55    58    67         1     2     3
10 ID10        0      81    91    91         1     2     3

Thank you so much in advance!

CodePudding user response:

This is along the same lines as Andy Baxter's answer, but instead of a having a matrix and an index for each exp this combines the two pieces of information to make each column name i.e. instead of exp[,1] [,2] [,3] index[,1] [,2] [,3] you have exp1_1 exp1_2 exp1_3 exp2_1 exp2_2 exp2_3. Would this suit your use-case?

library(tidyverse)
set.seed(123)

person_data <- tibble(ID=paste0("ID", 1:50), outcome=sample(0:1, 50, T), date=sample(5:145, 50, F),min_date=date-5)
value_data <- tibble(date=seq(1:150), exp1=sample(20:100, 150, T), exp2=sample(20:100, 150, T))

merged <- fuzzyjoin::fuzzy_left_join(person_data, value_data, by=c("date" = "date", "min_date" = "date"), match_fun=c(`>=`,`<=`))

analysis_data <- merged %>% 
  select(-date.y) %>% 
  pivot_longer(starts_with("exp")) %>%
  group_by(ID, name) %>%
  mutate(n = row_number()) %>%
  pivot_wider(names_from = c(name, n),
              values_from = value,
              names_sort = TRUE)
analysis_data
#> # A tibble: 50 × 16
#> # Groups:   ID [50]
#>    ID    outcome date.x min_d…¹ exp1_1 exp1_2 exp1_3 exp1_4 exp1_5 exp1_6 exp2_1
#>    <chr>   <int>  <int>   <dbl>  <int>  <int>  <int>  <int>  <int>  <int>  <int>
#>  1 ID1         0     57      52     39     86     55     71     41     68     59
#>  2 ID2         0    139     134     44     27     74     64     37     61     98
#>  3 ID3         0    145     140     50     25     90     80     67     36     79
#>  4 ID4         1     38      33     74     94     55     58     73     28     55
#>  5 ID5         0     73      68     91     63     51     55     64     33     32
#>  6 ID6         1     76      71     55     64     33     35     52     59     41
#>  7 ID7         1     80      75     52     59     59     29     91     28     55
#>  8 ID8         1     67      62     27     65     85     96     65     89    100
#>  9 ID9         0    101      96     73     96     30     44     71     45     67
#> 10 ID10        0     95      90     45     52     76     48     29     72     99
#> # … with 40 more rows, 5 more variables: exp2_2 <int>, exp2_3 <int>,
#> #   exp2_4 <int>, exp2_5 <int>, exp2_6 <int>, and abbreviated variable name
#> #   ¹​min_date
#> # ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names

Created on 2022-08-10 by the reprex package (v2.0.1)

CodePudding user response:

As perhaps half of an answer which would at least mean you don't have to repeat the typing of exp, exp2... up to exp50 etc., you could pivot_longer before joining and pivot_wider after nesting to give the right number of list columns. These can then be plucked out into matrix columns by pattern matching column names:

library(tidyverse)

person_data <-
  tibble(
    ID = paste0("ID", 1:50),
    outcome = sample(0:1, 50, T),
    date = sample(5:145, 50, F),
    min_date = date - 5
  )

value_data <-
  tibble(
    date = seq(1:150),
    exp1 = sample(20:100, 150, T),
    exp2 = sample(20:100, 150, T)
  )


merged <-
  fuzzyjoin::fuzzy_left_join(
    person_data,
    value_data |> pivot_longer(-date, names_to = "exp", values_to = "val"),
    by = c("date" = "date", "min_date" = "date"),
    match_fun = c(`>=`, `<=`)
  )

merged |> 
  select(-date.y) |> 
  group_by(ID, exp) |> 
  nest(val = val) |> 
  pivot_wider(names_from = exp,
              values_from = val) |> 
  rowwise() |>
  mutate(across(starts_with("exp"), ~ t(as.matrix(.x$val))))
#> # A tibble: 50 × 6
#> # Rowwise:  ID
#>    ID    outcome date.x min_date exp1[,1]  [,2]  [,3]  [,4]  [,5]  [,6] exp2[,1]
#>    <chr>   <int>  <int>    <dbl>    <int> <int> <int> <int> <int> <int>    <int>
#>  1 ID1         0    118      113       96    83    69    59    42    34       60
#>  2 ID2         1      9        4       38    91    83    28    87    73       96
#>  3 ID3         1     83       78       49    31    33    84    57    50       83
#>  4 ID4         1     97       92       91    92    21    53    59    39       23
#>  5 ID5         1     71       66       97    28    56    91    67    43       98
#>  6 ID6         1     27       22       81    88    41    22    24    84       36
#>  7 ID7         1     64       59       46    51    88    76    39    63       53
#>  8 ID8         0     72       67       28    56    91    67    43    55       23
#>  9 ID9         0     80       75       76    79    62    49    31    33       66
#> 10 ID10        0     87       82       57    50    31    72    95    31       66
#> # … with 40 more rows, and 1 more variable: exp2[2:6] <int>

Created on 2022-08-10 by the reprex package (v2.0.1)

As noted, it's not fully answering your question, but would at least get you to the half-way-acceptable stage you'd reached when producing analysis_data.

The things remaining would be:

  1. is this where you expect your exp1 etc. variables to go?
  2. it's not clear where your index values would come from for each row?
  • Related