How can I match-up rows in R data frame-CodePudding

I have a data frame that looks something like this:

participant	Sex	Age	interval	reproduction	condition
22014	Female	18	NA	NA	NA
22014	Female	18	1.536131	NA	NA
22014	Female	18	NA	NA	NA
22014	Female	18	1.416826	NA	NA
22014	Female	18	NA	NA	NA
22014	Female	18	1.549845	NA	NA
22014	Female	18	NA	NA	NA
22014	Female	18	1.542681	NA	NA
22014	Female	18	NA	NA	NA
22014	Female	18	1.265929	NA	NA
22014	Female	18	NA	1.2531	NA
22014	Female	18	NA	1.2507	NA
22014	Female	18	NA	1.7841	NA
22014	Female	18	NA	1.3536	NA
22014	Female	18	NA	0.8031	NA
22014	Female	18	NA	NA	Non-Causal

...etc...

I need to do 3 things:

i) 'backfill' the values in 'condition' upwards so that every cell in 'condition' upwards from a valid entry (here Non-Causal) is filled with that valid entry.

ii) match the 5 entries in 'reproduction' with the 5 entries in 'interval' in corresponding order, i.e. so that 1.2531 is moved up to be next to 1.536131, and 1.2507 with 1.416826 etc

iii) get rid of the NA rows so that in the end there are only 5 rows left, with valid entries in each of the columns

Any hints on how to tackle this? The actual dataframe is much longer, and 'condition' takes on different values; there will always be 5 entries, though ,per condition, and they should have matched interval and reproduction entries

CodePudding user response：

You can group and summarize:

library(dplyr)
dat %>%
  group_by(participant, Sex, Age) %>%
  summarize(across(c(interval, reproduction, condition), ~ .[!is.na(.)])) %>%
  ungroup()
# # A tibble: 5 x 6
#   participant Sex      Age interval reproduction condition 
#         <int> <chr>  <int>    <dbl>        <dbl> <chr>     
# 1       22014 Female    18     1.54        1.25  Non-Causal
# 2       22014 Female    18     1.42        1.25  Non-Causal
# 3       22014 Female    18     1.55        1.78  Non-Causal
# 4       22014 Female    18     1.54        1.35  Non-Causal
# 5       22014 Female    18     1.27        0.803 Non-Causal

(This will glitch if the number of non-NA in condition is other than 1, or if the number of non-NA in the other columns is not the same.)

CodePudding user response：

You can so most of the work with dplyr and tidyr. For example if your data is in a data.frame named dd,

library(dplyr)
library(tidyr)
dd %>% 
  group_by(participant, Sex, Age) %>% 
  fill(condition, .direction="up") %>% 
  summarize(across(everything(), ~head(na.omit(.x), 5)))

We use tidyr::fill to back fill the condition, then use use dplyr::summarize() to keep only the first 5 non-NA for all the columns that are not use for grouping the rows.

CodePudding user response：

Here is a base R solution, except for the function na.locf, from package zoo.

df1$condition <- with(df1, ave(condition, participant, FUN = \(x) zoo::na.locf(x, fromLast =TRUE)))
i <- with(df1, ave(interval, participant, FUN = \(x) !is.na(x)))
j <- with(df1, ave(reproduction, participant, FUN = \(x) !is.na(x)))
df1$reproduction[as.logical(i)] <- df1$reproduction[as.logical(j)]
df1$reproduction[as.logical(j)] <- NA_real_
df1 <- df1[complete.cases(df1), ]

df1
#   participant    Sex Age interval reproduction  condition
#2        22014 Female  18 1.536131       1.2531 Non-Causal
#4        22014 Female  18 1.416826       1.2507 Non-Causal
#6        22014 Female  18 1.549845       1.7841 Non-Causal
#8        22014 Female  18 1.542681       1.3536 Non-Causal
#10       22014 Female  18 1.265929       0.8031 Non-Causal

Data

df1 <- read.table(text = "
participant     Sex     Age     interval    reproduction    condition
22014   Female  18  NA  NA  NA
22014   Female  18  1.536131    NA  NA
22014   Female  18  NA  NA  NA
22014   Female  18  1.416826    NA  NA
22014   Female  18  NA  NA  NA
22014   Female  18  1.549845    NA  NA
22014   Female  18  NA  NA  NA
22014   Female  18  1.542681    NA  NA
22014   Female  18  NA  NA  NA
22014   Female  18  1.265929    NA  NA
22014   Female  18  NA  1.2531  NA
22014   Female  18  NA  1.2507  NA
22014   Female  18  NA  1.7841  NA
22014   Female  18  NA  1.3536  NA
22014   Female  18  NA  0.8031  NA
22014   Female  18  NA  NA  Non-Causal
", header = TRUE)

CodePudding user response：

This is the long way of what r2evans and Mr.Flick represent:

library(dplyr)
library(tidyr)
df %>% 
  fill(condition, .direction = "up") %>% 
  mutate(id = row_number()) %>% 
  pivot_longer(
    cols = c(interval, reproduction)
  ) %>% 
  na.omit() %>% 
  pivot_wider(
    names_from = name,
    values_from = value
  ) %>% 
  mutate(reproduction = lead(reproduction,5)) %>% 
  na.omit() %>% 
  select(-id) %>% 
  relocate(condition, .after = 6)