Home > Enterprise >  How to only select rows that are duplicated in a column in a dataframe
How to only select rows that are duplicated in a column in a dataframe

Time:03-14

I have joined two dataframes together and I am trying to select only the 'Branch Codes' that are duplicated.

I want to join the datasets 'BranchData' and 'BranchCode' so that any branch codes that are common to both datasets are included as well as those are not common to both datasets.

However, the last line of the code below does not seem to work!

BranchData$'Branch Code' <
as.numeric(BranchData$'Branch Code')
BranchCalls$'Branch Code' <- as.numeric(BranchCalls$'Branch Code')
BranchData <- na.omit(BranchData)
merged <- full_join(BranchData,BranchCalls)
merged <- merged %>% group_by(merged$`Branch Code`) %>% filter(n() >= 2)

Minimal Reproducible Example:

    structure(list(`Branch Code` = c(401801, 436801, 403801, 164801, 
198801), `Location Type` = c("Urban", "Urban", "Urban Deprived", 
"Rural", "Urban"), Type = c("MAIN", "MAIN", "MAIN", "MAIN", "LM"
), Status = c("Open", "Open", "Open", "Open", "Open"), Segment = c("Agency", 
"Agency", "Agency", "Agency", "Agency"), `Multiple (partner that owns multiple branches)` = c("Multiple 11", 
"Multiple 11", "Multiple 12", "Multiple 13", "Multiple 13"), 
    RetailType = c("Books_Stationery", "Books_Stationery", "Convenience", 
    "Convenience", "Convenience"), `Volume of transactions` = c(2238, 
    1514, 1346, 1338, 625), `Open hours` = c(47.75, 50.2500000000001, 
    46.5, 48.25, 114.25), `X Pos` = c(394169, 393488, 394434, 
    392153, 393094), `Y Pos` = c(806326, 805877, 804347, 796902, 
    802789), Urbanity = c("Major Centre", "Major Centre", "High Density", 
    "Low Density", "Low Density"), `Case Reference Number` = c(NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_
    ), `Created On` = structure(c(NA_real_, NA_real_, NA_real_, 
    NA_real_, NA_real_), tzone = "UTC", class = c("POSIXct", 
    "POSIXt")), `Branch Type` = c(NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_), L1 = c(NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_
    ), L2 = c(NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_), L3 = c(NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_), L4 = c(NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_), `Case Type` = c(NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_
    )), row.names = c(NA, -5L), class = c("tbl_df", "tbl", "data.frame"
), na.action = structure(c(`3` = 3L, `4` = 4L, `5` = 5L, `6` = 6L, 
`7` = 7L, `8` = 8L, `9` = 9L, `11` = 11L, `13` = 13L, `16` = 16L, 
`17` = 17L, `18` = 18L, `20` = 20L, `21` = 21L, `22` = 22L, `23` = 23L, 
`26` = 26L, `27` = 27L, `28` = 28L, `29` = 29L, `31` = 31L, `32` = 32L, 
`33` = 33L, `34` = 34L, `35` = 35L, `36` = 36L, `37` = 37L, `39` = 39L, 
`40` = 40L, `41` = 41L, `42` = 42L, `43` = 43L, `44` = 44L, `45` = 45L, 
`46` = 46L, `47` = 47L, `48` = 48L, `49` = 49L, `51` = 51L, `52` = 52L, 
`54` = 54L, `55` = 55L, `57` = 57L, `58` = 58L, `59` = 59L, `60` = 60L, 
`61` = 61L, `62` = 62L, `63` = 63L, `65` = 65L, `67` = 67L, `68` = 68L, 
`69` = 69L, `70` = 70L, `71` = 71L, `72` = 72L, `74` = 74L, `75` = 75L, 
`76` = 76L, `77` = 77L, `78` = 78L, `80` = 80L, `81` = 81L, `82` = 82L, 
`83` = 83L, `84` = 84L, `86` = 86L, `87` = 87L, `88` = 88L, `89` = 89L, 
`91` = 91L, `92` = 92L, `93` = 93L, `96` = 96L, `97` = 97L, `98` = 98L, 
`99` = 99L, `100` = 100L, `101` = 101L, `103` = 103L, `106` = 106L, 
`107` = 107L, `108` = 108L, `109` = 109L, `110` = 110L, `111` = 111L, 
`112` = 112L, `113` = 113L, `114` = 114L, `115` = 115L, `116` = 116L, 
`117` = 117L, `118` = 118L, `119` = 119L, `120` = 120L, `121` = 121L, 
`122` = 122L, `123` = 123L, `124` = 124L, `126` = 126L, `127` = 127L, 
`129` = 129L, `130` = 130L, `131` = 131L, `132` = 132L, `133` = 133L, 
`134` = 134L, `135` = 135L, `136` = 136L, `137` = 137L, `139` = 139L, 
`140` = 140L, `141` = 141L, `142` = 142L, `143` = 143L, `144` = 144L, 
`145` = 145L, `146` = 146L, `147` = 147L, `148` = 148L, `149` = 149L, 
`150` = 150L, `151` = 151L, `152` = 152L, `153` = 153L, `155` = 155L, 
`156` = 156L, `157` = 157L, `160` = 160L, `161` = 161L, `162` = 162L, 
`163` = 163L, `165` = 165L, `166` = 166L, `167` = 167L, `168` = 168L, 
`169` = 169L, `174` = 174L, `175` = 175L, `176` = 176L, `177` = 177L, 
`178` = 178L, `179` = 179L, `180` = 180L, `182` = 182L, `183` = 183L, 
`185` = 185L, `186` = 186L, `188` = 188L, `189` = 189L, `190` = 190L, 
`191` = 191L, `192` = 192L, `193` = 193L, `194` = 194L, `195` = 195L, 
`196` = 196L, `197` = 197L, `198` = 198L, `199` = 199L, `200` = 200L, 
`201` = 201L, `203` = 203L, `204` = 204L, `205` = 205L, `206` = 206L, 
`207` = 207L, `209` = 209L, `210` = 210L, `211` = 211L, `212` = 212L, 
`213` = 213L, `214` = 214L, `215` = 215L, `216` = 216L, `217` = 217L, 
`218` = 218L, `219` = 219L, `220` = 220L, `221` = 221L, `222` = 222L, 
`223` = 223L, `224` = 224L, `226` = 226L, `227` = 227L, `228` = 228L, 
`229` = 229L, `230` = 230L, `231` = 231L, `232` = 232L, `233` = 233L, 
`234` = 234L, `236` = 236L, `237` = 237L, `238` = 238L, `239` = 239L, 
`240` = 240L, `241` = 241L, `242` = 242L, `243` = 243L, `244` = 244L, 
`245` = 245L, `247` = 247L, `248` = 248L, `249` = 249L, `250` = 250L, 
`251` = 251L, `252` = 252L, `253` = 253L, `254` = 254L, `255` = 255L, 
`256` = 256L, `257` = 257L, `258` = 258L, `259` = 259L, `260` = 260L, 
`261` = 261L, `262` = 262L, `263` = 263L, `264` = 264L, `265` = 265L, 
`266` = 266L, `267` = 267L, `268` = 268L, `269` = 269L, `270` = 270L, 
`271` = 271L, `272` = 272L, `273` = 273L, `274` = 274L, `276` = 276L, 
`278` = 278L, `280` = 280L, `281` = 281L, `282` = 282L, `283` = 283L, 
`284` = 284L, `285` = 285L, `286` = 286L, `288` = 288L, `289` = 289L, 
`291` = 291L, `292` = 292L, `293` = 293L, `294` = 294L, `296` = 296L, 
`297` = 297L, `298` = 298L, `299` = 299L, `300` = 300L, `301` = 301L, 
`304` = 304L, `305` = 305L, `306` = 306L, `307` = 307L, `308` = 308L, 
`311` = 311L, `312` = 312L, `313` = 313L, `316` = 316L, `319` = 319L, 
`321` = 321L, `322` = 322L, `323` = 323L, `324` = 324L, `325` = 325L, 
`326` = 326L, `327` = 327L, `328` = 328L, `329` = 329L, `330` = 330L, 
`331` = 331L, `332` = 332L, `333` = 333L, `335` = 335L, `337` = 337L, 
`338` = 338L, `339` = 339L, `340` = 340L, `341` = 341L, `342` = 342L, 
`343` = 343L, `344` = 344L, `345` = 345L, `346` = 346L, `347` = 347L, 
`348` = 348L, `349` = 349L, `350` = 350L, `351` = 351L, `352` = 352L, 
`353` = 353L, `354` = 354L, `355` = 355L, `356` = 356L, `357` = 357L, 
`359` = 359L, `360` = 360L, `361` = 361L, `362` = 362L, `363` = 363L, 
`365` = 365L, `366` = 366L, `367` = 367L, `368` = 368L, `370` = 370L, 
`371` = 371L, `372` = 372L, `373` = 373L, `375` = 375L, `376` = 376L, 
`378` = 378L, `379` = 379L, `380` = 380L, `381` = 381L, `382` = 382L, 
`384` = 384L, `385` = 385L, `387` = 387L, `388` = 388L, `389` = 389L, 
`390` = 390L, `391` = 391L, `392` = 392L, `393` = 393L, `395` = 395L, 
`396` = 396L, `397` = 397L, `398` = 398L, `399` = 399L, `400` = 400L, 
`401` = 401L, `403` = 403L, `404` = 404L, `405` = 405L, `409` = 409L, 
`412` = 412L, `413` = 413L, `414` = 414L, `415` = 415L, `416` = 416L, 
`418` = 418L, `419` = 419L, `420` = 420L, `421` = 421L, `422` = 422L, 
`423` = 423L, `426` = 426L, `427` = 427L, `428` = 428L, `429` = 429L, 
`432` = 432L, `433` = 433L, `435` = 435L, `436` = 436L, `437` = 437L, 
`438` = 438L, `440` = 440L, `441` = 441L, `442` = 442L, `443` = 443L, 

I would be so grateful if anybody could give me a helping hand!

Thank you so much!

CodePudding user response:

You can do it using table:

merged %>% filter(table(`Branch Code`)[`Branch Code`] > 1)

or using add_count:

merged %>% add_count(`Branch Code`) %>% filter(n > 1)

CodePudding user response:

I created a small sample data:

merged <- data.frame(branch_code = c("401801", "436801", "401801"),
                     location_type = c("Urban", "Urban", "Rural"))

  branch_code location_type
1      401801         Urban
2      436801         Urban
3      401801         Rural

You can use this code:

merged %>%
  group_by(branch_code) %>%
  mutate(n = n()) %>%
  filter(n > 1) %>%
  select(-n)

Output:

# A tibble: 2 × 2
# Groups:   branch_code [1]
  branch_code location_type     
  <chr>       <chr>         
1 401801      Urban             
2 401801      Rural             
  • Related