I've been given an Excel file in which the end of each group of data is marked by a row that is blank except for one cell which contains a string like "Person 1", "Person 2", "Person 3", and so on. The data belonging to Person 1 are in rows preceding the row containing "Person 1", the data belonging to Person 2 are in the rows between the row with "Person 1" and the row containing "Person 2". This pattern is followed until the end of the file, where the last row contains a cell with "Person 100". To make things even more interesting, the "Person [n]" string is not always in the same column and the number of rows per person can vary. See the toy example below.
df_1 <- data.frame(iv1=c(rbinom(3,1,.4), NA, rbinom(4,1,.4), NA, rbinom(2,1,.4), NA),
iv2=c(rbinom(3,1,.4), NA, rbinom(4,1,.4), NA, rbinom(2,1,.4), NA),
iv3=c(rbinom(3,1,.4), "Person 1", rbinom(4,1,.4), NA, rbinom(2,1,.4), "Person 3"),
dv1=c(rbinom(3,1,.4), NA, rbinom(4,1,.4), "Person 2", rbinom(2,1,.4), NA),
dv2=c(rbinom(3,1,.4), NA, rbinom(4,1,.4), NA, rbinom(2,1,.4), NA),
dv3=c(rbinom(3,1,.4), NA, rbinom(4,1,.4), NA, rbinom(2,1,.4), NA))
Yields this data frame
iv1 iv2 iv3 dv1 dv2 dv3
1 1 1 0 1 1 0
2 0 0 1 0 0 0
3 1 0 0 1 0 1
4 NA NA Person 1 <NA> NA NA
5 1 1 0 0 1 1
6 1 0 0 0 0 0
7 0 0 0 1 0 0
8 1 0 0 1 1 1
9 NA NA <NA> Person 2 NA NA
10 0 0 0 1 0 0
11 0 1 0 0 0 1
12 NA NA Person 3 <NA> NA NA
What I would like to do is create a new column ("Person_ID") that identifies the data belonging to each person, so Person_ID would equal 1 for rows belonging to Person 1, Person_ID would equal 2 for rows belonging to Person 2, and so on, as in the data frame below.
iv1 iv2 iv3 dv1 dv2 dv3 Person_ID
1 1 1 0 1 1 0 1
2 0 0 1 0 0 0 1
3 1 0 0 1 0 1 1
4 1 1 0 0 1 1 2
5 1 0 0 0 0 0 2
6 0 0 0 1 0 0 2
7 1 0 0 1 1 1 2
8 0 0 0 1 0 0 3
9 0 1 0 0 0 1 3
I would love a dplyr-based solution, but of course, I'm open to whatever works. Thanks!
CodePudding user response:
We could do it this way:
The values in iv1:dv3
do not match because you did not set a seed:
First solution is depending on NA
s that may interfere with other NA
data.
The second solution is independent of NA
s:
library(dplyr)
df_1 %>%
mutate(Person_ID=cumsum(is.na(iv1)) 1) %>%
na.omit()
iv1 iv2 iv3 dv1 dv2 dv3 Person_ID
<int> <int> <chr> <chr> <int> <int> <dbl>
1 0 0 0 0 0 0 1
2 1 1 1 0 1 0 1
3 1 0 0 0 0 0 1
4 1 1 0 0 0 1 2
5 1 1 0 0 0 1 2
6 1 0 0 0 1 1 2
7 0 0 1 1 1 0 2
8 0 0 1 0 0 0 3
9 1 1 0 1 0 0 3
Another way could be:
library(tidyverse)
df_1 %>%
mutate(Person_ID = coalesce(iv3, dv1),
Person_ID = ifelse(str_detect(Person_ID, "Person"), parse_number(Person_ID), NA)) %>%
fill(Person_ID, .direction = "up") %>%
na.omit()