R: delete duplicate rows that have been flipped-CodePudding

Row 1 and row 4 have the same information. The only difference is the column they appear under has been flipped.

I already know Yuma County and Cheyenne County are neighbors from row 1. I don't need this information reiterated in row 4.

           countyname fipscounty          neighborname fipsneighbor
1     Yuma County, CO       8125   Cheyenne County, KS        20023
2     Yuma County, CO       8125      Chase County, NE        31029
3 Cheyenne County, KS      20023 Kit Carson County, CO         8063
4 Cheyenne County, KS      20023       Yuma County, CO         8125
5 Cheyenne County, KS      20023      Dundy County, NE        31057

I don't mind that the counties appear more than once, I only care that the overall information in each row be different from the previous. I want to keep row 1 and delete row 4, so that the final looks like this

           countyname fipscounty          neighborname fipsneighbor
1     Yuma County, CO       8125   Cheyenne County, KS        20023
2     Yuma County, CO       8125      Chase County, NE        31029
3 Cheyenne County, KS      20023 Kit Carson County, CO         8063
5 Cheyenne County, KS      20023      Dundy County, NE        31057

How can I delete rows with duplicate information in the dataset?

CodePudding user response：

You could also do:

idx <- duplicated(t(apply(CountyList[c('fipscounty', 'fipsneighbor')], 1, sort)))
CountyList[!idx, ]

          countyname fipscounty          neighborname fipsneighbor
1     Yuma County, CO       8125   Cheyenne County, KS        20023
2     Yuma County, CO       8125      Chase County, NE        31029
3 Cheyenne County, KS      20023 Kit Carson County, CO         8063
5 Cheyenne County, KS      20023      Dundy County, NE        31057

CodePudding user response：

Here's another possible base R option:

df[!duplicated(t(apply(df, 1, sort))),]

Output

         countyname fipscounty          neighborname fipsneighbor
1     Yuma County, CO       8125   Cheyenne County, KS        20023
2     Yuma County, CO       8125      Chase County, NE        31029
3 Cheyenne County, KS      20023 Kit Carson County, CO         8063
5 Cheyenne County, KS      20023      Dundy County, NE        31057

Data

df <- structure(list(countyname = c("Yuma County, CO", "Yuma County, CO", 
"Cheyenne County, KS", "Cheyenne County, KS", "Cheyenne County, KS"
), fipscounty = c(8125L, 8125L, 20023L, 20023L, 20023L), neighborname = c("Cheyenne County, KS", 
"Chase County, NE", "Kit Carson County, CO", "Yuma County, CO", 
"Dundy County, NE"), fipsneighbor = c(20023L, 31029L, 8063L, 
8125L, 31057L)), class = "data.frame", row.names = c(NA, -5L))

CodePudding user response：

We can use interaction to generate unique factors after finding the name with "smaller" (i.e. first in alphabet) name as well as "larger" name. Then we can filter the data.frame based on that:

CountyList <- read.table(text="countyname fipscounty          neighborname fipsneighbor
1     'Yuma County, CO'       8125   'Cheyenne County, KS'        20023
2     'Yuma County, CO'       8125      'Chase County, NE'        31029
3 'Cheyenne County, KS'      20023 'Kit Carson County, CO'         8063
4 'Cheyenne County, KS'      20023       'Yuma County, CO'         8125
5 'Cheyenne County, KS'      20023      'Dundy County, NE'        31057")


fname <- pmin(CountyList$countyname,CountyList$neighborname) #Get first name
lname <- pmax(CountyList$countyname,CountyList$neighborname) #Get last names

duplicate.key <- as.numeric(interaction(fname,lname)) # Create factors from interaction and convert to numeric

CountyList[match(unique(duplicate.key),duplicate.key),] # Only keep first occurence


           countyname fipscounty          neighborname fipsneighbor
1     Yuma County, CO       8125   Cheyenne County, KS        20023
2     Yuma County, CO       8125      Chase County, NE        31029
3 Cheyenne County, KS      20023 Kit Carson County, CO         8063
5 Cheyenne County, KS      20023      Dundy County, NE        31057

CodePudding user response：

Here's a tidyverse approach.

First unite all columns together into new_col (i.e. to paste all columns together). Then split the new_col back to it's individual pieces and sort them. Save this into new_col2. Next we only keep the distinct rows of new_col2. Finally removes the newly created columns.

library(tidyverse)

df %>% 
  unite("new_col", everything(), sep = "_", remove = F) %>% 
  rowwise() %>% 
  mutate(new_col2 = paste(sort(str_split(new_col, "_", simplify = T)), collapse = "")) %>% 
  ungroup() %>% 
  distinct(new_col2, .keep_all = T) %>% 
  select(-starts_with("new_col"))

# A tibble: 4 × 4
  countyname          fipscounty neighborname          fipsneighbor
  <chr>                    <int> <chr>                        <int>
1 Yuma County, CO           8125 Cheyenne County, KS          20023
2 Yuma County, CO           8125 Chase County, NE             31029
3 Cheyenne County, KS      20023 Kit Carson County, CO         8063
4 Cheyenne County, KS      20023 Dundy County, NE             31057

Data

df <- structure(list(countyname = c("Yuma County, CO", "Yuma County, CO", 
"Cheyenne County, KS", "Cheyenne County, KS", "Cheyenne County, KS"
), fipscounty = c(8125L, 8125L, 20023L, 20023L, 20023L), neighborname = c("Cheyenne County, KS", 
"Chase County, NE", "Kit Carson County, CO", "Yuma County, CO", 
"Dundy County, NE"), fipsneighbor = c(20023L, 31029L, 8063L, 
8125L, 31057L)), class = "data.frame", row.names = c(NA, -5L))