Home > Enterprise >  How to drop rows in R with almost the same column value?
How to drop rows in R with almost the same column value?

Time:07-07

I have a dataset with a column of names. I would like to drop rows with the lesser "P" value, if there exists one with a higher value. For example, in the dataset below, I would like to drop the row ID's 3 and 5 since there exists a 'Texas P5' and a 'North Dakota P9.' What is the best way to do this? Thanks in advance!

ID Name Score
1 Minnesota P2 342
2 Vermont P7 342
3 Texas P4 65
4 New Mexico 643
5 North Dakota P8 78
6 North Dakota P9 245
7 Texas P5 856
8 Minnesota LP 342

CodePudding user response:

Here is a base R way. Use ave to split the data by Name excluding the numbers, and check which group element is equal to its greatest element. ave returns a vector of the same class as its input, in this case character. So coerce to logical and subset the original data frame.

x<-"
ID  Name    Score
1   'Minnesota P2'  342
2   'Vermont P7'    342
3   'Texas P4'  65
4   'New Mexico'    643
5   'North Dakota P8'   78
6   'North Dakota P9'   245
7   'Texas P5'  856
8   'Minnesota LP'  342"
df1 <- read.table(textConnection(x), header = TRUE)


i <- with(df1, ave(Name, sub("\\d ", "", Name), FUN = \(x){
  x == tail(sort(x), 1)
}))
df1[as.logical(i),]
#>   ID            Name Score
#> 1  1    Minnesota P2   342
#> 2  2      Vermont P7   342
#> 4  4      New Mexico   643
#> 6  6 North Dakota P9   245
#> 7  7        Texas P5   856
#> 8  8    Minnesota LP   342

Created on 2022-07-06 by the reprex package (v2.0.1)

CodePudding user response:

Another base option:

dat <- cbind(dat, strcapture("(.*[^ ]) *P([0-9] )$", dat$Name, list(Name2 = "", P = 0L)))
isna <- is.na(dat$P)
dat$P[isna] <- 0L; dat$Name2[isna] <- dat$Name[isna]
dat
#   ID            Name Score        Name2 P
# 1  1    Minnesota P2   342    Minnesota 2
# 2  2      Vermont P7   342      Vermont 7
# 3  3        Texas P4    65        Texas 4
# 4  4      New Mexico   643   New Mexico 0
# 5  5 North Dakota P8    78 North Dakota 8
# 6  6 North Dakota P9   245 North Dakota 9
# 7  7        Texas P5   856        Texas 5
# 8  8    Minnesota LP   342 Minnesota LP 0

dat[ave(dat$P, dat$Name2, FUN = function(z) seq_along(z) == which.max(z)) > 0,]
#   ID            Name Score        Name2 P
# 1  1    Minnesota P2   342    Minnesota 2
# 2  2      Vermont P7   342      Vermont 7
# 4  4      New Mexico   643   New Mexico 0
# 6  6 North Dakota P9   245 North Dakota 9
# 7  7        Texas P5   856        Texas 5
# 8  8    Minnesota LP   342 Minnesota LP 0

CodePudding user response:

Try this solution, which presupposes that the Name parts by which the Namevalues differ are always 2 characters long, preceded by whitespace, and positioned at the end of the string:

library(dplyr)
df %>%
  # create a dummy var without the variable part in `Name`:
  mutate(Names_dum = sub("\\s.{2}$","",Name)) %>%
  # for each `Names_dum`...:
  group_by(Names_dum) %>%
  # now filter for the maximum value:
  filter(Score == max(Score)) %>%
  ungroup() %>%
  # remove the dummy:
  select(-Names_dum)
# A tibble: 6 × 2
  Name            Score
  <chr>           <dbl>
1 Minnesota P2      342
2 Vermont P7        342
3 New Mexico        643
4 North Dakota P9   245
5 Texas P5          856
6 Minnesota LP      342

Data:

df <- data.frame(Name = c("Minnesota P2","Vermont P7", "Texas P4", "New Mexico", "North Dakota P8", "North Dakota P9", "Texas P5", "Minnesota LP"), 
                 Score = c(342,342,65,643,78,245,856,342))

CodePudding user response:

Using a data.table approach:

library(data.table)

dt[,.(Name, Score, Score1 = max(Score)), .(gsub(" [A-Z0-9] $", "", Name))][
  Score >= Score1, .(Name, Score)]

#>               Name Score
#> 1:    Minnesota P2   342
#> 2:    Minnesota LP   342
#> 3:      Vermont P7   342
#> 4:        Texas P5   856
#> 5:      New Mexico   643
#> 6: North Dakota P9   245
  • Related