I have a dataframe that contains personal data like this:
data2 <- data.frame(Name = c("Peter Parker", "Peter Parker", "Peter Parker",
"Peter Parker", "Peter Parker", "Peter Parker", "Peter Parker", "Benny Thompson", "Benny Thompson",
"Benny Thompson", "Benny Thompson", "Benny Thompson", "Mylo Thony"
), Result = c(150L, 155L, 156L, 154L, 158L, 159L, 159L, 130L, 132L, 138L,
140L, 139L, 177L), Date = c("2018-03-03", "2018-03-04", "2018-03-05",
"2018-03-06", "2018-03-07", "2018-03-09", "2018-03-13", "2018-03-03", "2018-03-04", "2018-03-05",
"2018-03-07", "2018-03-09", "2018-03-11"))
> data2
Name Result Date
1 Peter Parker 150 2018-03-03
2 Peter Parker 155 2018-03-04
3 Peter Parker 156 2018-03-05
4 Peter Parker 154 2018-03-06
5 Peter Parker 158 2018-03-07
6 Peter Parker 159 2018-03-09
7 Peter Parker 159 2018-03-13
8 Benny Thompson 130 2018-03-03
9 Benny Thompson 132 2018-03-04
10 Benny Thompson 138 2018-03-05
11 Benny Thompson 140 2018-03-07
12 Benny Thompson 139 2018-03-09
13 Mylo Thony 177 2018-03-11
It has a lot more columns which I omitted for brevity.
I want to remove the excessive entries of everyone with more than 5 entries. In this case Benny has exaclty 5 entries, so he can stay "as is". Mylo does only have one entry, so his entry will eventually be dealt with at a later stage in my code (will be removed). For Peter my preferred solution would be to create 3 sets of 5 entries, with "Peter Parker1", "Peter Parker2" and "Peter Parker3" as names. I don't really care about the exact names, since only the results and eventually the dates are used for some predictions. I want it to look something like this:
Name Result Date
Peter Parker1 150 2018-03-03
Peter Parker1 155 2018-03-04
Peter Parker1 156 2018-03-05
Peter Parker1 154 2018-03-06
Peter Parker1 158 2018-03-07
Peter Parker2 155 2018-03-04
Peter Parker2 156 2018-03-05
Peter Parker2 154 2018-03-06
Peter Parker2 158 2018-03-07
Peter Parker2 159 2018-03-09
Peter Parker3 156 2018-03-05
Peter Parker3 154 2018-03-06
Peter Parker3 158 2018-03-07
Peter Parker3 159 2018-03-09
Peter Parker3 159 2018-03-13
Benny Thompson 130 2018-03-03
Benny Thompson 132 2018-03-04
Benny Thompson 138 2018-03-05
Benny Thompson 140 2018-03-07
Benny Thompson 139 2018-03-09
Mylo Thony 177 2018-03-11
I have no clue how to achieve that. In case this is not possible I just want to remove all excessive entries of Peter and just keep the first 5 so I can make at least good use of those first 5
CodePudding user response:
library(tidyverse)
data2 <- data.frame(Name = c("Peter Parker", "Peter Parker", "Peter Parker",
"Peter Parker", "Peter Parker", "Peter Parker", "Peter Parker", "Benny Thompson", "Benny Thompson",
"Benny Thompson", "Benny Thompson", "Benny Thompson", "Mylo Thony"
), Result = c(150L, 155L, 156L, 154L, 158L, 159L, 159L, 130L, 132L, 138L,
140L, 139L, 177L), Date = c("2018-03-03", "2018-03-04", "2018-03-05",
"2018-03-06", "2018-03-07", "2018-03-09", "2018-03-13", "2018-03-03", "2018-03-04", "2018-03-05",
"2018-03-07", "2018-03-09", "2018-03-11"))
data2 %>%
group_by(Name) %>%
mutate(
n = n(),
group = ceiling(row_number()/5),
Name = if_else(n>5, paste0(Name, group), Name)
) %>%
select(-n,-group)
#> # A tibble: 13 × 3
#> # Groups: Name [4]
#> Name Result Date
#> <chr> <int> <chr>
#> 1 Peter Parker1 150 2018-03-03
#> 2 Peter Parker1 155 2018-03-04
#> 3 Peter Parker1 156 2018-03-05
#> 4 Peter Parker1 154 2018-03-06
#> 5 Peter Parker1 158 2018-03-07
#> 6 Peter Parker2 159 2018-03-09
#> 7 Peter Parker2 159 2018-03-13
#> 8 Benny Thompson 130 2018-03-03
#> 9 Benny Thompson 132 2018-03-04
#> 10 Benny Thompson 138 2018-03-05
#> 11 Benny Thompson 140 2018-03-07
#> 12 Benny Thompson 139 2018-03-09
#> 13 Mylo Thony 177 2018-03-11
Created on 2023-01-06 by the reprex package (v2.0.1)
Is something like this what you're after?