Home > Software engineering >  How to conditionally delete rows with a duplicate column value in R
How to conditionally delete rows with a duplicate column value in R

Time:09-16

I have a tibble which consists of scraped link text from the wayback machine. Some links were put into the wayback machine multiple times so we have duplicate links, but with slightly different endings.

Example:

https://www.bjjcompsystem.com/tournaments/1869/categories/2053148

https://www.bjjcompsystem.com/tournaments/1869/categories/2053148?locale=en

https://www.bjjcompsystem.com/tournaments/1869/categories/2053148?locale=pt-BR

The trouble is that since the URLs in the data frame have slight differences in the ending of their URLs as shown above, I'm not sure how I can get R to delete rows with duplicate URLs.

Is there a way to remove all the rows with duplicate links if those duplicates have slightly different endings?

This is what I've done, but this just splits out a vector with the formatted links. I wanted the formatted links in the original data frame.

library(rvest); library(tidyverse); library(stringr)
library(httr2); library(janitor)

link_text = "https://web.archive.org/web/timemap/json?url=https://www.bjjcompsystem.com/tournaments/1869/categories&matchType=prefix&collapse=urlkey&output=json&fl=original,mimetype,timestamp,endtimestamp,groupcount,uniqcount&filter=!statuscode:[45]..&limit=10000&_=1663136483842" %>% 
  request() %>% 
  req_perform() %>% 
  resp_body_json(simplifyVector = TRUE) %>% 
  as_tibble() %>% 
  row_to_names(1)

link_text = link_text[-c(1, 788:790),]
link_text2 = link_text
link_text2 = as.list(substr(link_text2$original, 1, 65)) 
link_text2 = link_text2[!duplicated(link_text2)]
head(link_text2)

CodePudding user response:

Here's a dplyr-based answer! Thanks for the great reproducible example.

  1. We create a column of the first 65 characters in the link with mutate()
  2. We use distinct() to select only those rows with distinct link prefixes
  3. We get rid of the dummy column with select()
link_text_no_duplicates <- link_text %>% 
  mutate(duplicate_check=substr(original, 1, 65)) %>% # creates column of the first 65 characters in each link
  distinct(duplicate_check, .keep_all = TRUE) %>% # selects only those rows with distinct first 65 link characters
  select(!duplicate_check) # removes the dummy column we created to check distinctness

If you want to keep the column with the distinct link prefixes, just remove that last select() line. Not sure if those are the "formatted links" you mentioned.

Hope that helps!

CodePudding user response:

Please find a variant which uses data.table and stringr. This works if we assume that all links have an explicit question mark ? at some point. The code then extracts the root of the links and checks the duplicated rows over that.

library(data.table)
library(stringr)

df = data.table(id_row = 1L:3L,
                id_group = rep(1L, 3),
                link = c("https://www.bjjcompsystem.com/tournaments/1869/categories/2053148",
                         "https://www.bjjcompsystem.com/tournaments/1869/categories/2053148?locale=en",
                         "https://www.bjjcompsystem.com/tournaments/1869/categories/2053148?locale=pt-BR"))
df
#>    id_row id_group
#>     <int>    <int>
#> 1:      1        1
#> 2:      2        1
#> 3:      3        1
#>                                                                              link
#>                                                                            <char>
#> 1:              https://www.bjjcompsystem.com/tournaments/1869/categories/2053148
#> 2:    https://www.bjjcompsystem.com/tournaments/1869/categories/2053148?locale=en
#> 3: https://www.bjjcompsystem.com/tournaments/1869/categories/2053148?locale=pt-BR
# remove everything after the "?"
df[ , root_link := str_remove(link, "\\?.*$")]
df
#>    id_row id_group
#>     <int>    <int>
#> 1:      1        1
#> 2:      2        1
#> 3:      3        1
#>                                                                              link
#>                                                                            <char>
#> 1:              https://www.bjjcompsystem.com/tournaments/1869/categories/2053148
#> 2:    https://www.bjjcompsystem.com/tournaments/1869/categories/2053148?locale=en
#> 3: https://www.bjjcompsystem.com/tournaments/1869/categories/2053148?locale=pt-BR
#>                                                            root_link
#>                                                               <char>
#> 1: https://www.bjjcompsystem.com/tournaments/1869/categories/2053148
#> 2: https://www.bjjcompsystem.com/tournaments/1869/categories/2053148
#> 3: https://www.bjjcompsystem.com/tournaments/1869/categories/2053148

# check duplicates over "root_link"
# here it defines an explicit variable to check for the duplications
# you can use unique() to directly remove the duplicates
df[ , check_unique := duplicated(root_link)]
df
#>    id_row id_group
#>     <int>    <int>
#> 1:      1        1
#> 2:      2        1
#> 3:      3        1
#>                                                                              link
#>                                                                            <char>
#> 1:              https://www.bjjcompsystem.com/tournaments/1869/categories/2053148
#> 2:    https://www.bjjcompsystem.com/tournaments/1869/categories/2053148?locale=en
#> 3: https://www.bjjcompsystem.com/tournaments/1869/categories/2053148?locale=pt-BR
#>                                                            root_link
#>                                                               <char>
#> 1: https://www.bjjcompsystem.com/tournaments/1869/categories/2053148
#> 2: https://www.bjjcompsystem.com/tournaments/1869/categories/2053148
#> 3: https://www.bjjcompsystem.com/tournaments/1869/categories/2053148
#>    check_unique
#>          <lgcl>
#> 1:        FALSE
#> 2:         TRUE
#> 3:         TRUE

# remove duplicates
df_unique = df[ check_unique == FALSE ]
df_unique
#>    id_row id_group
#>     <int>    <int>
#> 1:      1        1
#>                                                                 link
#>                                                               <char>
#> 1: https://www.bjjcompsystem.com/tournaments/1869/categories/2053148
#>                                                            root_link
#>                                                               <char>
#> 1: https://www.bjjcompsystem.com/tournaments/1869/categories/2053148
#>    check_unique
#>          <lgcl>
#> 1:        FALSE

Created on 2022-09-15 by the reprex package (v2.0.1)

  • Related