I have a data frame that looks something like this:
participant | Sex | Age | interval | reproduction | condition |
---|---|---|---|---|---|
22014 | Female | 18 | NA | NA | NA |
22014 | Female | 18 | 1.536131 | NA | NA |
22014 | Female | 18 | NA | NA | NA |
22014 | Female | 18 | 1.416826 | NA | NA |
22014 | Female | 18 | NA | NA | NA |
22014 | Female | 18 | 1.549845 | NA | NA |
22014 | Female | 18 | NA | NA | NA |
22014 | Female | 18 | 1.542681 | NA | NA |
22014 | Female | 18 | NA | NA | NA |
22014 | Female | 18 | 1.265929 | NA | NA |
22014 | Female | 18 | NA | 1.2531 | NA |
22014 | Female | 18 | NA | 1.2507 | NA |
22014 | Female | 18 | NA | 1.7841 | NA |
22014 | Female | 18 | NA | 1.3536 | NA |
22014 | Female | 18 | NA | 0.8031 | NA |
22014 | Female | 18 | NA | NA | Non-Causal |
...etc...
I need to do 3 things:
i) 'backfill' the values in 'condition' upwards so that every cell in 'condition' upwards from a valid entry (here Non-Causal) is filled with that valid entry.
ii) match the 5 entries in 'reproduction' with the 5 entries in 'interval' in corresponding order, i.e. so that 1.2531 is moved up to be next to 1.536131, and 1.2507 with 1.416826 etc
iii) get rid of the NA rows so that in the end there are only 5 rows left, with valid entries in each of the columns
Any hints on how to tackle this? The actual dataframe is much longer, and 'condition' takes on different values; there will always be 5 entries, though ,per condition, and they should have matched interval and reproduction entries
CodePudding user response:
You can group and summarize:
library(dplyr)
dat %>%
group_by(participant, Sex, Age) %>%
summarize(across(c(interval, reproduction, condition), ~ .[!is.na(.)])) %>%
ungroup()
# # A tibble: 5 x 6
# participant Sex Age interval reproduction condition
# <int> <chr> <int> <dbl> <dbl> <chr>
# 1 22014 Female 18 1.54 1.25 Non-Causal
# 2 22014 Female 18 1.42 1.25 Non-Causal
# 3 22014 Female 18 1.55 1.78 Non-Causal
# 4 22014 Female 18 1.54 1.35 Non-Causal
# 5 22014 Female 18 1.27 0.803 Non-Causal
(This will glitch if the number of non-NA
in condition
is other than 1
, or if the number of non-NA
in the other columns is not the same.)
CodePudding user response:
You can so most of the work with dplyr
and tidyr
. For example if your data is in a data.frame named dd
,
library(dplyr)
library(tidyr)
dd %>%
group_by(participant, Sex, Age) %>%
fill(condition, .direction="up") %>%
summarize(across(everything(), ~head(na.omit(.x), 5)))
We use tidyr::fill
to back fill the condition, then use use dplyr::summarize()
to keep only the first 5 non-NA for all the columns that are not use for grouping the rows.
CodePudding user response:
Here is a base R solution, except for the function na.locf
, from package zoo
.
df1$condition <- with(df1, ave(condition, participant, FUN = \(x) zoo::na.locf(x, fromLast =TRUE)))
i <- with(df1, ave(interval, participant, FUN = \(x) !is.na(x)))
j <- with(df1, ave(reproduction, participant, FUN = \(x) !is.na(x)))
df1$reproduction[as.logical(i)] <- df1$reproduction[as.logical(j)]
df1$reproduction[as.logical(j)] <- NA_real_
df1 <- df1[complete.cases(df1), ]
df1
# participant Sex Age interval reproduction condition
#2 22014 Female 18 1.536131 1.2531 Non-Causal
#4 22014 Female 18 1.416826 1.2507 Non-Causal
#6 22014 Female 18 1.549845 1.7841 Non-Causal
#8 22014 Female 18 1.542681 1.3536 Non-Causal
#10 22014 Female 18 1.265929 0.8031 Non-Causal
Data
df1 <- read.table(text = "
participant Sex Age interval reproduction condition
22014 Female 18 NA NA NA
22014 Female 18 1.536131 NA NA
22014 Female 18 NA NA NA
22014 Female 18 1.416826 NA NA
22014 Female 18 NA NA NA
22014 Female 18 1.549845 NA NA
22014 Female 18 NA NA NA
22014 Female 18 1.542681 NA NA
22014 Female 18 NA NA NA
22014 Female 18 1.265929 NA NA
22014 Female 18 NA 1.2531 NA
22014 Female 18 NA 1.2507 NA
22014 Female 18 NA 1.7841 NA
22014 Female 18 NA 1.3536 NA
22014 Female 18 NA 0.8031 NA
22014 Female 18 NA NA Non-Causal
", header = TRUE)
CodePudding user response:
This is the long way of what r2evans and Mr.Flick represent:
library(dplyr)
library(tidyr)
df %>%
fill(condition, .direction = "up") %>%
mutate(id = row_number()) %>%
pivot_longer(
cols = c(interval, reproduction)
) %>%
na.omit() %>%
pivot_wider(
names_from = name,
values_from = value
) %>%
mutate(reproduction = lead(reproduction,5)) %>%
na.omit() %>%
select(-id) %>%
relocate(condition, .after = 6)