Home > Enterprise >  How can I match-up rows in R data frame
How can I match-up rows in R data frame

Time:11-03

I have a data frame that looks something like this:

participant Sex Age interval reproduction condition
22014 Female 18 NA NA NA
22014 Female 18 1.536131 NA NA
22014 Female 18 NA NA NA
22014 Female 18 1.416826 NA NA
22014 Female 18 NA NA NA
22014 Female 18 1.549845 NA NA
22014 Female 18 NA NA NA
22014 Female 18 1.542681 NA NA
22014 Female 18 NA NA NA
22014 Female 18 1.265929 NA NA
22014 Female 18 NA 1.2531 NA
22014 Female 18 NA 1.2507 NA
22014 Female 18 NA 1.7841 NA
22014 Female 18 NA 1.3536 NA
22014 Female 18 NA 0.8031 NA
22014 Female 18 NA NA Non-Causal

...etc...

I need to do 3 things:

i) 'backfill' the values in 'condition' upwards so that every cell in 'condition' upwards from a valid entry (here Non-Causal) is filled with that valid entry.

ii) match the 5 entries in 'reproduction' with the 5 entries in 'interval' in corresponding order, i.e. so that 1.2531 is moved up to be next to 1.536131, and 1.2507 with 1.416826 etc

iii) get rid of the NA rows so that in the end there are only 5 rows left, with valid entries in each of the columns

Any hints on how to tackle this? The actual dataframe is much longer, and 'condition' takes on different values; there will always be 5 entries, though ,per condition, and they should have matched interval and reproduction entries

CodePudding user response:

You can group and summarize:

library(dplyr)
dat %>%
  group_by(participant, Sex, Age) %>%
  summarize(across(c(interval, reproduction, condition), ~ .[!is.na(.)])) %>%
  ungroup()
# # A tibble: 5 x 6
#   participant Sex      Age interval reproduction condition 
#         <int> <chr>  <int>    <dbl>        <dbl> <chr>     
# 1       22014 Female    18     1.54        1.25  Non-Causal
# 2       22014 Female    18     1.42        1.25  Non-Causal
# 3       22014 Female    18     1.55        1.78  Non-Causal
# 4       22014 Female    18     1.54        1.35  Non-Causal
# 5       22014 Female    18     1.27        0.803 Non-Causal

(This will glitch if the number of non-NA in condition is other than 1, or if the number of non-NA in the other columns is not the same.)

CodePudding user response:

You can so most of the work with dplyr and tidyr. For example if your data is in a data.frame named dd,

library(dplyr)
library(tidyr)
dd %>% 
  group_by(participant, Sex, Age) %>% 
  fill(condition, .direction="up") %>% 
  summarize(across(everything(), ~head(na.omit(.x), 5)))

We use tidyr::fill to back fill the condition, then use use dplyr::summarize() to keep only the first 5 non-NA for all the columns that are not use for grouping the rows.

CodePudding user response:

Here is a base R solution, except for the function na.locf, from package zoo.

df1$condition <- with(df1, ave(condition, participant, FUN = \(x) zoo::na.locf(x, fromLast =TRUE)))
i <- with(df1, ave(interval, participant, FUN = \(x) !is.na(x)))
j <- with(df1, ave(reproduction, participant, FUN = \(x) !is.na(x)))
df1$reproduction[as.logical(i)] <- df1$reproduction[as.logical(j)]
df1$reproduction[as.logical(j)] <- NA_real_
df1 <- df1[complete.cases(df1), ]

df1
#   participant    Sex Age interval reproduction  condition
#2        22014 Female  18 1.536131       1.2531 Non-Causal
#4        22014 Female  18 1.416826       1.2507 Non-Causal
#6        22014 Female  18 1.549845       1.7841 Non-Causal
#8        22014 Female  18 1.542681       1.3536 Non-Causal
#10       22014 Female  18 1.265929       0.8031 Non-Causal

Data

df1 <- read.table(text = "
participant     Sex     Age     interval    reproduction    condition
22014   Female  18  NA  NA  NA
22014   Female  18  1.536131    NA  NA
22014   Female  18  NA  NA  NA
22014   Female  18  1.416826    NA  NA
22014   Female  18  NA  NA  NA
22014   Female  18  1.549845    NA  NA
22014   Female  18  NA  NA  NA
22014   Female  18  1.542681    NA  NA
22014   Female  18  NA  NA  NA
22014   Female  18  1.265929    NA  NA
22014   Female  18  NA  1.2531  NA
22014   Female  18  NA  1.2507  NA
22014   Female  18  NA  1.7841  NA
22014   Female  18  NA  1.3536  NA
22014   Female  18  NA  0.8031  NA
22014   Female  18  NA  NA  Non-Causal
", header = TRUE)

CodePudding user response:

This is the long way of what r2evans and Mr.Flick represent:

library(dplyr)
library(tidyr)
df %>% 
  fill(condition, .direction = "up") %>% 
  mutate(id = row_number()) %>% 
  pivot_longer(
    cols = c(interval, reproduction)
  ) %>% 
  na.omit() %>% 
  pivot_wider(
    names_from = name,
    values_from = value
  ) %>% 
  mutate(reproduction = lead(reproduction,5)) %>% 
  na.omit() %>% 
  select(-id) %>% 
  relocate(condition, .after = 6)
  • Related