I have a data cleaning question. The data collection happened three times and sometimes the data entry was incorrect. Therefore, if the students had their data collected more than one time, the second
data point needs to be copied over.
Here is my dataset looks like:
df <- data.frame(id = c(1,1,1, 2,2,2, 3,3, 4,4, 5),
text = c("female","male","male", "female","female","female", "male","female","male", "female", "female"),
time = c("first","second","third", "first","second","third", "first","second","second", "third", "first"))
> df
id text time
1 1 female first
2 1 male second
3 1 male third
4 2 female first
5 2 female second
6 2 female third
7 3 male first
8 3 female second
9 4 male second
10 4 female third
11 5 female first
So id
1,3, and 4 have incorrect gender information. I need to copy the second
data point when there are multiple/different input about the gender
variable. If there is only one datapoint, that should stay in the dataset.
The desired output would be
> df1
id text time
1 1 male first
2 1 male second
3 1 male third
4 2 female first
5 2 female second
6 2 female third
7 3 female first
8 3 female second
9 4 male second
10 4 male third
11 5 female first
Any ideas? Thanks!
CodePudding user response:
With {dplyr} we can use the following approach:
- we
group_by(id)
- check in
ifelse
if there is an element intext
whentime == "second"
and to do this we uselength
- if thats the case use
text[time == "second"]
otherwise usetext
I just wonder, what should happen if you have three data entires and first
and second
are the same and third
is different. Then the approach above won't work.
Also what should happen if first
is "male"
, second
is "female"
and third
is again "male"
. Which one should be taken?
The approach below just uses second
if its available and ignores the rest.
library(dplyr)
df %>%
group_by(id) %>%
mutate(text = ifelse(length(text[time == "second"]) > 0,
text[time == "second"],
text))
#> # A tibble: 11 × 3
#> # Groups: id [5]
#> id text time
#> <dbl> <chr> <chr>
#> 1 1 male first
#> 2 1 male second
#> 3 1 male third
#> 4 2 female first
#> 5 2 female second
#> 6 2 female third
#> 7 3 female first
#> 8 3 female second
#> 9 4 male second
#> 10 4 male third
#> 11 5 female first
Created on 2022-09-15 by the reprex package (v0.3.0)
CodePudding user response:
Just another approach for the fun of it;
library(dplyr)
df %>%
filter(time =="second") %>%
select(-time) %>%
full_join(df, ., by ="id", suffix = c("_old", "")) %>%
mutate(text = coalesce(text, text_old)) %>%
select(names(df))
#> id text time
#> 1 1 male first
#> 2 1 male second
#> 3 1 male third
#> 4 2 female first
#> 5 2 female second
#> 6 2 female third
#> 7 3 female first
#> 8 3 female second
#> 9 4 male second
#> 10 4 male third
#> 11 5 female first
CodePudding user response:
We could use match
library(dplyr)
df %>%
group_by(id) %>%
mutate(text = text[match("second", time, nomatch = 1)]) %>%
ungroup
-output
# A tibble: 11 × 3
id text time
<dbl> <chr> <chr>
1 1 male first
2 1 male second
3 1 male third
4 2 female first
5 2 female second
6 2 female third
7 3 female first
8 3 female second
9 4 male second
10 4 male third
11 5 female first
Or using coalesce
df %>%
group_by(id) %>%
mutate(text = coalesce(text[match("second", time)], text)) %>%
ungroup
-output
# A tibble: 11 × 3
id text time
<dbl> <chr> <chr>
1 1 male first
2 1 male second
3 1 male third
4 2 female first
5 2 female second
6 2 female third
7 3 female first
8 3 female second
9 4 male second
10 4 male third
11 5 female first