Row 1 and row 4 have the same information. The only difference is the column they appear under has been flipped.
I already know Yuma County and Cheyenne County are neighbors from row 1. I don't need this information reiterated in row 4.
countyname fipscounty neighborname fipsneighbor
1 Yuma County, CO 8125 Cheyenne County, KS 20023
2 Yuma County, CO 8125 Chase County, NE 31029
3 Cheyenne County, KS 20023 Kit Carson County, CO 8063
4 Cheyenne County, KS 20023 Yuma County, CO 8125
5 Cheyenne County, KS 20023 Dundy County, NE 31057
I don't mind that the counties appear more than once, I only care that the overall information in each row be different from the previous. I want to keep row 1 and delete row 4, so that the final looks like this
countyname fipscounty neighborname fipsneighbor
1 Yuma County, CO 8125 Cheyenne County, KS 20023
2 Yuma County, CO 8125 Chase County, NE 31029
3 Cheyenne County, KS 20023 Kit Carson County, CO 8063
5 Cheyenne County, KS 20023 Dundy County, NE 31057
How can I delete rows with duplicate information in the dataset?
CodePudding user response:
You could also do:
idx <- duplicated(t(apply(CountyList[c('fipscounty', 'fipsneighbor')], 1, sort)))
CountyList[!idx, ]
countyname fipscounty neighborname fipsneighbor
1 Yuma County, CO 8125 Cheyenne County, KS 20023
2 Yuma County, CO 8125 Chase County, NE 31029
3 Cheyenne County, KS 20023 Kit Carson County, CO 8063
5 Cheyenne County, KS 20023 Dundy County, NE 31057
CodePudding user response:
Here's another possible base R option:
df[!duplicated(t(apply(df, 1, sort))),]
Output
countyname fipscounty neighborname fipsneighbor
1 Yuma County, CO 8125 Cheyenne County, KS 20023
2 Yuma County, CO 8125 Chase County, NE 31029
3 Cheyenne County, KS 20023 Kit Carson County, CO 8063
5 Cheyenne County, KS 20023 Dundy County, NE 31057
Data
df <- structure(list(countyname = c("Yuma County, CO", "Yuma County, CO",
"Cheyenne County, KS", "Cheyenne County, KS", "Cheyenne County, KS"
), fipscounty = c(8125L, 8125L, 20023L, 20023L, 20023L), neighborname = c("Cheyenne County, KS",
"Chase County, NE", "Kit Carson County, CO", "Yuma County, CO",
"Dundy County, NE"), fipsneighbor = c(20023L, 31029L, 8063L,
8125L, 31057L)), class = "data.frame", row.names = c(NA, -5L))
CodePudding user response:
We can use interaction
to generate unique factors after finding the name with "smaller" (i.e. first in alphabet) name as well as "larger" name. Then we can filter the data.frame
based on that:
CountyList <- read.table(text="countyname fipscounty neighborname fipsneighbor
1 'Yuma County, CO' 8125 'Cheyenne County, KS' 20023
2 'Yuma County, CO' 8125 'Chase County, NE' 31029
3 'Cheyenne County, KS' 20023 'Kit Carson County, CO' 8063
4 'Cheyenne County, KS' 20023 'Yuma County, CO' 8125
5 'Cheyenne County, KS' 20023 'Dundy County, NE' 31057")
fname <- pmin(CountyList$countyname,CountyList$neighborname) #Get first name
lname <- pmax(CountyList$countyname,CountyList$neighborname) #Get last names
duplicate.key <- as.numeric(interaction(fname,lname)) # Create factors from interaction and convert to numeric
CountyList[match(unique(duplicate.key),duplicate.key),] # Only keep first occurence
countyname fipscounty neighborname fipsneighbor
1 Yuma County, CO 8125 Cheyenne County, KS 20023
2 Yuma County, CO 8125 Chase County, NE 31029
3 Cheyenne County, KS 20023 Kit Carson County, CO 8063
5 Cheyenne County, KS 20023 Dundy County, NE 31057
CodePudding user response:
Here's a tidyverse
approach.
First unite
all columns together into new_col
(i.e. to paste all columns together). Then split the new_col
back to it's individual pieces and sort
them. Save this into new_col2
. Next we only keep the distinct
rows of new_col2
. Finally removes the newly created columns.
library(tidyverse)
df %>%
unite("new_col", everything(), sep = "_", remove = F) %>%
rowwise() %>%
mutate(new_col2 = paste(sort(str_split(new_col, "_", simplify = T)), collapse = "")) %>%
ungroup() %>%
distinct(new_col2, .keep_all = T) %>%
select(-starts_with("new_col"))
# A tibble: 4 × 4
countyname fipscounty neighborname fipsneighbor
<chr> <int> <chr> <int>
1 Yuma County, CO 8125 Cheyenne County, KS 20023
2 Yuma County, CO 8125 Chase County, NE 31029
3 Cheyenne County, KS 20023 Kit Carson County, CO 8063
4 Cheyenne County, KS 20023 Dundy County, NE 31057
Data
df <- structure(list(countyname = c("Yuma County, CO", "Yuma County, CO",
"Cheyenne County, KS", "Cheyenne County, KS", "Cheyenne County, KS"
), fipscounty = c(8125L, 8125L, 20023L, 20023L, 20023L), neighborname = c("Cheyenne County, KS",
"Chase County, NE", "Kit Carson County, CO", "Yuma County, CO",
"Dundy County, NE"), fipsneighbor = c(20023L, 31029L, 8063L,
8125L, 31057L)), class = "data.frame", row.names = c(NA, -5L))