Home > OS >  Assign group identifiers to groups of rows falling between rows containing a string in R
Assign group identifiers to groups of rows falling between rows containing a string in R

Time:03-18

I've been given an Excel file in which the end of each group of data is marked by a row that is blank except for one cell which contains a string like "Person 1", "Person 2", "Person 3", and so on. The data belonging to Person 1 are in rows preceding the row containing "Person 1", the data belonging to Person 2 are in the rows between the row with "Person 1" and the row containing "Person 2". This pattern is followed until the end of the file, where the last row contains a cell with "Person 100". To make things even more interesting, the "Person [n]" string is not always in the same column and the number of rows per person can vary. See the toy example below.

 df_1 <- data.frame(iv1=c(rbinom(3,1,.4), NA, rbinom(4,1,.4), NA, rbinom(2,1,.4), NA),
            iv2=c(rbinom(3,1,.4), NA, rbinom(4,1,.4), NA, rbinom(2,1,.4), NA),
            iv3=c(rbinom(3,1,.4), "Person 1", rbinom(4,1,.4), NA, rbinom(2,1,.4), "Person 3"),
            dv1=c(rbinom(3,1,.4), NA, rbinom(4,1,.4), "Person 2", rbinom(2,1,.4), NA),
            dv2=c(rbinom(3,1,.4), NA, rbinom(4,1,.4), NA, rbinom(2,1,.4), NA),
            dv3=c(rbinom(3,1,.4), NA, rbinom(4,1,.4), NA, rbinom(2,1,.4), NA))

Yields this data frame

   iv1 iv2      iv3      dv1 dv2 dv3
1    1   1        0        1   1   0
2    0   0        1        0   0   0
3    1   0        0        1   0   1
4   NA  NA Person 1     <NA>  NA  NA
5    1   1        0        0   1   1
6    1   0        0        0   0   0
7    0   0        0        1   0   0
8    1   0        0        1   1   1
9   NA  NA     <NA> Person 2  NA  NA
10   0   0        0        1   0   0
11   0   1        0        0   0   1
12  NA  NA Person 3     <NA>  NA  NA

What I would like to do is create a new column ("Person_ID") that identifies the data belonging to each person, so Person_ID would equal 1 for rows belonging to Person 1, Person_ID would equal 2 for rows belonging to Person 2, and so on, as in the data frame below.

  iv1 iv2 iv3 dv1 dv2 dv3 Person_ID
1   1   1   0   1   1   0         1
2   0   0   1   0   0   0         1
3   1   0   0   1   0   1         1
4   1   1   0   0   1   1         2
5   1   0   0   0   0   0         2
6   0   0   0   1   0   0         2
7   1   0   0   1   1   1         2
8   0   0   0   1   0   0         3
9   0   1   0   0   0   1         3

I would love a dplyr-based solution, but of course, I'm open to whatever works. Thanks!

CodePudding user response:

We could do it this way: The values in iv1:dv3 do not match because you did not set a seed:

First solution is depending on NAs that may interfere with other NA data. The second solution is independent of NAs:

library(dplyr)

df_1 %>% 
  mutate(Person_ID=cumsum(is.na(iv1)) 1) %>% 
  na.omit()
   iv1   iv2 iv3   dv1     dv2   dv3 Person_ID
  <int> <int> <chr> <chr> <int> <int>     <dbl>
1     0     0 0     0         0     0         1
2     1     1 1     0         1     0         1
3     1     0 0     0         0     0         1
4     1     1 0     0         0     1         2
5     1     1 0     0         0     1         2
6     1     0 0     0         1     1         2
7     0     0 1     1         1     0         2
8     0     0 1     0         0     0         3
9     1     1 0     1         0     0         3

Another way could be:

library(tidyverse)

df_1 %>% 
  mutate(Person_ID = coalesce(iv3, dv1),
         Person_ID = ifelse(str_detect(Person_ID, "Person"), parse_number(Person_ID), NA)) %>% 
  fill(Person_ID, .direction = "up") %>% 
  na.omit()
  • Related