I have a dataset with a column of names. I would like to drop rows with the lesser "P" value, if there exists one with a higher value. For example, in the dataset below, I would like to drop the row ID's 3 and 5 since there exists a 'Texas P5' and a 'North Dakota P9.' What is the best way to do this? Thanks in advance!
ID | Name | Score |
---|---|---|
1 | Minnesota P2 | 342 |
2 | Vermont P7 | 342 |
3 | Texas P4 | 65 |
4 | New Mexico | 643 |
5 | North Dakota P8 | 78 |
6 | North Dakota P9 | 245 |
7 | Texas P5 | 856 |
8 | Minnesota LP | 342 |
CodePudding user response:
Here is a base R way. Use ave
to split the data by Name
excluding the numbers, and check which group element is equal to its greatest element. ave
returns a vector of the same class as its input, in this case character. So coerce to logical and subset the original data frame.
x<-"
ID Name Score
1 'Minnesota P2' 342
2 'Vermont P7' 342
3 'Texas P4' 65
4 'New Mexico' 643
5 'North Dakota P8' 78
6 'North Dakota P9' 245
7 'Texas P5' 856
8 'Minnesota LP' 342"
df1 <- read.table(textConnection(x), header = TRUE)
i <- with(df1, ave(Name, sub("\\d ", "", Name), FUN = \(x){
x == tail(sort(x), 1)
}))
df1[as.logical(i),]
#> ID Name Score
#> 1 1 Minnesota P2 342
#> 2 2 Vermont P7 342
#> 4 4 New Mexico 643
#> 6 6 North Dakota P9 245
#> 7 7 Texas P5 856
#> 8 8 Minnesota LP 342
Created on 2022-07-06 by the reprex package (v2.0.1)
CodePudding user response:
Another base option:
dat <- cbind(dat, strcapture("(.*[^ ]) *P([0-9] )$", dat$Name, list(Name2 = "", P = 0L)))
isna <- is.na(dat$P)
dat$P[isna] <- 0L; dat$Name2[isna] <- dat$Name[isna]
dat
# ID Name Score Name2 P
# 1 1 Minnesota P2 342 Minnesota 2
# 2 2 Vermont P7 342 Vermont 7
# 3 3 Texas P4 65 Texas 4
# 4 4 New Mexico 643 New Mexico 0
# 5 5 North Dakota P8 78 North Dakota 8
# 6 6 North Dakota P9 245 North Dakota 9
# 7 7 Texas P5 856 Texas 5
# 8 8 Minnesota LP 342 Minnesota LP 0
dat[ave(dat$P, dat$Name2, FUN = function(z) seq_along(z) == which.max(z)) > 0,]
# ID Name Score Name2 P
# 1 1 Minnesota P2 342 Minnesota 2
# 2 2 Vermont P7 342 Vermont 7
# 4 4 New Mexico 643 New Mexico 0
# 6 6 North Dakota P9 245 North Dakota 9
# 7 7 Texas P5 856 Texas 5
# 8 8 Minnesota LP 342 Minnesota LP 0
CodePudding user response:
Try this solution, which presupposes that the Name
parts by which the Name
values differ are always 2 characters long, preceded by whitespace, and positioned at the end of the string:
library(dplyr)
df %>%
# create a dummy var without the variable part in `Name`:
mutate(Names_dum = sub("\\s.{2}$","",Name)) %>%
# for each `Names_dum`...:
group_by(Names_dum) %>%
# now filter for the maximum value:
filter(Score == max(Score)) %>%
ungroup() %>%
# remove the dummy:
select(-Names_dum)
# A tibble: 6 × 2
Name Score
<chr> <dbl>
1 Minnesota P2 342
2 Vermont P7 342
3 New Mexico 643
4 North Dakota P9 245
5 Texas P5 856
6 Minnesota LP 342
Data:
df <- data.frame(Name = c("Minnesota P2","Vermont P7", "Texas P4", "New Mexico", "North Dakota P8", "North Dakota P9", "Texas P5", "Minnesota LP"),
Score = c(342,342,65,643,78,245,856,342))
CodePudding user response:
Using a data.table
approach:
library(data.table)
dt[,.(Name, Score, Score1 = max(Score)), .(gsub(" [A-Z0-9] $", "", Name))][
Score >= Score1, .(Name, Score)]
#> Name Score
#> 1: Minnesota P2 342
#> 2: Minnesota LP 342
#> 3: Vermont P7 342
#> 4: Texas P5 856
#> 5: New Mexico 643
#> 6: North Dakota P9 245