Copying information over when there are multiple datapoints in r-CodePudding

I have a data cleaning question. The data collection happened three times and sometimes the data entry was incorrect. Therefore, if the students had their data collected more than one time, the second data point needs to be copied over.

Here is my dataset looks like:

df <- data.frame(id = c(1,1,1, 2,2,2, 3,3,  4,4, 5),
                 text = c("female","male","male", "female","female","female", "male","female","male", "female", "female"),
                 time = c("first","second","third", "first","second","third", "first","second","second", "third", "first"))
            
> df
   id   text   time
1   1 female  first
2   1   male second
3   1   male  third
4   2 female  first
5   2 female second
6   2 female  third
7   3   male  first
8   3 female second
9   4   male second
10  4 female  third
11  5 female  first

So id 1,3, and 4 have incorrect gender information. I need to copy the second data point when there are multiple/different input about the gender variable. If there is only one datapoint, that should stay in the dataset.

The desired output would be

> df1
   id   text   time
1   1   male  first
2   1   male second
3   1   male  third
4   2 female  first
5   2 female second
6   2 female  third
7   3 female  first
8   3 female second
9   4   male second
10  4   male  third
11  5 female  first

Any ideas? Thanks!

CodePudding user response：

With {dplyr} we can use the following approach:

we group_by(id)
check in ifelse if there is an element in text when time == "second" and to do this we use length
if thats the case use text[time == "second"] otherwise use text

I just wonder, what should happen if you have three data entires and first and second are the same and third is different. Then the approach above won't work.

Also what should happen if first is "male", second is "female" and third is again "male". Which one should be taken?

The approach below just uses second if its available and ignores the rest.

library(dplyr)


df %>% 
  group_by(id) %>% 
  mutate(text = ifelse(length(text[time == "second"]) > 0,
                       text[time == "second"],
                       text))

#> # A tibble: 11 × 3
#> # Groups:   id [5]
#>       id text   time  
#>    <dbl> <chr>  <chr> 
#>  1     1 male   first 
#>  2     1 male   second
#>  3     1 male   third 
#>  4     2 female first 
#>  5     2 female second
#>  6     2 female third 
#>  7     3 female first 
#>  8     3 female second
#>  9     4 male   second
#> 10     4 male   third 
#> 11     5 female first

^{Created on 2022-09-15 by the reprex package (v0.3.0)}

CodePudding user response：

Just another approach for the fun of it;

library(dplyr)

df %>% 
  filter(time =="second") %>% 
  select(-time) %>% 
  full_join(df, ., by ="id", suffix = c("_old", "")) %>% 
  mutate(text = coalesce(text, text_old)) %>% 
  select(names(df))

#>       id text   time  
#>  1     1 male   first 
#>  2     1 male   second
#>  3     1 male   third 
#>  4     2 female first 
#>  5     2 female second
#>  6     2 female third 
#>  7     3 female first 
#>  8     3 female second
#>  9     4 male   second
#> 10     4 male   third 
#> 11     5 female first

CodePudding user response：

We could use match

library(dplyr)
df %>% 
  group_by(id) %>%
  mutate(text = text[match("second", time, nomatch = 1)]) %>%
  ungroup

-output

# A tibble: 11 × 3
      id text   time  
   <dbl> <chr>  <chr> 
 1     1 male   first 
 2     1 male   second
 3     1 male   third 
 4     2 female first 
 5     2 female second
 6     2 female third 
 7     3 female first 
 8     3 female second
 9     4 male   second
10     4 male   third 
11     5 female first

Or using coalesce

df %>% 
  group_by(id) %>%
  mutate(text = coalesce(text[match("second", time)], text)) %>%
  ungroup

-output

# A tibble: 11 × 3
      id text   time  
   <dbl> <chr>  <chr> 
 1     1 male   first 
 2     1 male   second
 3     1 male   third 
 4     2 female first 
 5     2 female second
 6     2 female third 
 7     3 female first 
 8     3 female second
 9     4 male   second
10     4 male   third 
11     5 female first