Home > Enterprise >  update column based on duplicated rows
update column based on duplicated rows

Time:10-13

I have a df as below

df <- data.frame(col = c("Acer laurinum", "Acer laurinum"),
                 author = c("", "Hassk."),
                 value = c(1,1))

df
                     col    author value1 value2
1          Acer laurinum                1      2
2          Acer laurinum    Hassk.      1      3
3 Acronychia pedunculata (L.) Miq.      0      1
4 Acronychia pedunculata                0      0

I would like to update the author column saying that if I found any duplicated rows in col column, then the blank cell of author associated with the value (i.e. Acer laurinum) will be updated to Hassk. And the value must be retained.

  • Desired output
                     col    author value1 value2
1          Acer laurinum    Hassk.      1      2
2          Acer laurinum    Hassk.      1      3
3 Acronychia pedunculata (L.) Miq.      0      1
4 Acronychia pedunculata (L.) Miq.      0      0

Any suggestions for this?

CodePudding user response:

You should use dplyr::group_by and dplyr::mutate with max:

df %>% group_by(col) %>%
   mutate(author=max(author))

Output:

  col                    author    value1 value2
  <chr>                  <chr>      <dbl>  <dbl>
1 Acer laurinum          Hassk.         1      2
2 Acer laurinum          Hassk.         1      3
3 Acronychia pedunculata (L.) Miq.      0      1
4 Acronychia pedunculata (L.) Miq.      0      0
> 

CodePudding user response:

# Base option 1:
df$author <- with(df, ave(author, col, FUN = max))

# Base option 2:
transform(
  with(
    df,
    replace(df, df == "", NA_character_)[
      order(col, author, decreasing = TRUE),
    ]
  ),
  author = na.omit(author)[cumsum(!(is.na(author)))]
)[row.names(df),]

Data:

df <- structure(list(col = c("Acer laurinum", "Acer laurinum", "Acronychia pedunculata", 
"Acronychia pedunculata"), author = c("", "Hassk.", "(L.) Miq.", 
""), value1 = c(1, 1, 0, 0), value2 = c(2, 3, 1, 0)), class = "data.frame", row.names = c(NA, 
-4L))
  • Related