Home > OS >  Combining column contents with duplicate information in some rows
Combining column contents with duplicate information in some rows

Time:12-15

I have been wrangling some data accessed from various databases and now have two columns that include duplicate information in some rows, but not in others. I would like to combine these two columns, but only keep one set of information if it is duplicate.

Here is an example of what my data looks like:

species <- c('taxon1', 'taxon2', 'taxon3', 'taxon4', 'taxon 5', 'taxon6','taxon7','taxon8')
continentmax <- c(NA,'North America','Oceania','Europe, North America, and Oceania',NA,NA,'Europe',NA)
continentmin <- c('South America','North America',NA,'Europe, North America, and Oceania',NA,NA,'Europe','Asia')

df <- data.frame(species, continentmax, continentmin)

  species                       continentmax                       continentmin
1  taxon1                               <NA>                      South America
2  taxon2                      North America                      North America
3  taxon3                            Oceania                               <NA>
4  taxon4 Europe, North America, and Oceania Europe, North America, and Oceania
5 taxon 5                               <NA>                               <NA>
6  taxon6                               <NA>                               <NA>
7  taxon7                             Europe                             Europe
8  taxon8                               <NA>                               Asia

For some rows both are NA, some have duplicate information, and some have information in only one column. I would like to have a combined continent column as output, like so:

 species                          continent
1  taxon1                      South America
2  taxon2                      North America
3  taxon3                            Oceania
4  taxon4 Europe, North America, and Oceania
5 taxon 5                               <NA>
6  taxon6                               <NA>
7  taxon7                             Europe
8  taxon8                               Asia

I have tried df$continent <- paste(df$continentmax, df$continentmin, collapse = ',') but the resulting column has each row filled with all continents. When I used df$continent <- paste(unique(c(df$continentmax[1], df$continentmin[1])), collapse = ',') I get approximately the desired result but only for the first row, and the NA is also pasted as text. I have over 2000 rows so this method is not feasible.

I have also tried using sapply and a tidy way using mutate to no avail. The closest I've come is by using df$continent <- do.call(paste, c(df[2:3], sep = ",")) with the result:

  species                                                             continent
1  taxon1                                                      NA,South America
2  taxon2                                           North America,North America
3  taxon3                                                            Oceania,NA
4  taxon4 Europe, North America, and Oceania,Europe, North America, and Oceania
5 taxon 5                                                                 NA,NA
6  taxon6                                                                 NA,NA
7  taxon7                                                         Europe,Europe
8  taxon8                                                               NA,Asia

The desired information is present in each row but the NAs are included and duplicate data is listed twice. I have considered writing a function to go row by row but haven't been able to get the syntax right.

Any ideas on how to approach this? I haven't found any answer that works for me in the many entries on combining column contents.

Any help is very much appreciated!

edit: Out of curiosity, does anyone have a data.table or base R solution ?

CodePudding user response:

How about this...

library(dplyr)

df %>%
  mutate(continent = case_when(continentmax == continentmin ~ continentmax,
                           is.na(continentmax) & !is.na(continentmin) ~ continentmin,
                           !is.na(continentmax) & is.na(continentmin) ~ continentmax,
                           is.na(continentmax) & is.na(continentmin) ~ NA_character_))

Or more efficiently... You can also use this.

df %>%
  rowwise() %>%
  mutate(continent = max(continentmax, continentmin, na.rm = T))

data.table solution

library(data.table)
df.dt = as.data.table(df)
df.dt2 = df.dt[,continent:=pmin(continentmax, continentmin, na.rm = T)]
head(df.dt2)
  • Related