How to clean the data by removing the duplicate in the column "Collection"?-CodePudding

theurl <- "https://cryptoslam.io/#sales-rankings-24h"
url <- curl(theurl, "rb")
urldata <- readLines(url, warn=FALSE)
data <- readHTMLTable(urldata, stringAsFactors = FALSE)
close(url)
data.2 <- data.frame(Reduce(rbind, data[1]))

data.3 <- data.2 %>% dplyr::select(Collection, Sales, Change..24h.) %>%
  head(10) %>% mutate(Sales.numeric = as.numeric(gsub('[$,]', '', Sales)))

The strings in the column, "Collection" are duplicated.

> data.3$Collection
 [1] "Bored Ape Yacht ClubBored Ape YC"          
 [2] "Mutant Ape Yacht ClubMutant Ape Yacht Club"
 [3] "CryptoPunksCryptoPunks"                    
 [4] "CloneXCloneX"                              
 [5] "MeebitsMeebits"                            
 [6] "Bored Ape Kennel ClubBored Ape Kennel Club"
 [7] "CrypToadzCrypToadz"                        
 [8] "AzukiAzuki"                                
 [9] "World Of WomenWorld Of Women"              
[10] "CrabadaCrabada"

Anyway to remove such duplicates?

CodePudding user response：

One way to solve this is by getting names from the website,

library(rvest)
library(dplyr)
name = theurl %>%  read_html() %>% html_nodes('.summary-sales-table__column-product-abbreviation') %>% html_text()
#as the data.2 has only 250 entries
name = name[1:250]
data.2$Collection =  name
              Collection       Sales Change..24h. Sales.numeric
1           Bored Ape YC $15,609,329        0.72%      15609329
2  Mutant Ape Yacht Club $13,337,117      438.65%      13337117
3            CryptoPunks  $6,188,758        9.88%       6188758
4                 CloneX  $5,977,297      397.96%       5977297
5                Meebits  $5,139,169       35.32%       5139169
6  Bored Ape Kennel Club  $3,052,526      392.60%       3052526
7                  Azuki  $2,736,697       63.56%       2736697
8         World Of Women  $2,671,328       36.28%       2671328
9                Crabada  $2,665,660       19.88%       2665660
10           RTFKT MNLTH  $2,182,638       48.33%       2182638