Sum rows with similar strings-CodePudding

I have a df looks like this (Since several questions of mine were not clear then I changed to this way)

df <- data.frame(name = c("Acer laurinum", "Acer laurinum Hassk.", "Acmella paniculata", 
                          "Adinandra cf. integerrima", "Adinandra cf. integerrima T.Anderson"),
                 value1 = c(1,2,3,4,5),
                 value2 = c(2,3,4,5,6))

                         name            value1 value2
1                        Acer laurinum      1      2
2                 Acer laurinum Hassk.      2      3
3                   Acmella paniculata      3      4
4            Adinandra cf. integerrima      4      5
5 Adinandra cf. integerrima T.Anderson      5      6

I would like to sum/aggregate rows based on the name column with the condition: If strings of two rows are similar, then they must be summed. And importantly, the name resulted by the aggregation would be the name with Author part. In the real data, the names located in different rows, then I just made an example here to show my expectation. Anyone help me,please?

Desired output

 name value1 value2

1                 Acer laurinum Hassk.      3      5
2                   Acmella paniculata      3      4
3 Adinandra cf. integerrima T.Anderson      9      11

CodePudding user response：

You can do something like the following:

LongestSuperstring <- function(x) {
  cx <- as.character(x)  # Convert to character in case of factors.
  xx <- unique(cx)  # Get only unique values in case of dupes.
  sx <- xx[rank(nchar(xx))] # Sort them by length ascending.
  matches <- outer(sx, sx, Vectorize(grepl), fixed = TRUE)  # Get substring matches.
  # Get the index of longest superstring or itself.
  sidx <- apply(apply(matches, 1, cumsum), 2, which.max) 
  sxv <- setNames(sx[sidx], sx) # Lookup from sx to to the longest superstring.
  out <- sxv[cx]  # Look up the superstring for every x
  # Convert back to factor if needed
  if (is.factor(x)) {
    out <- factor(out, levels = levels(x))
  }
  return(out)
}

df %>%
  group_by(name = LongestSuperstring(name)) %>%
  summarize(across(c(value1, value2), sum))

CodePudding user response：

Here's a somewhat radical solution based on the assumption that the only differentiating part between the paired strings is the addition of the author. If that assumption is correct then this might work:

df %>%
  mutate(name = trimws(str_remove(name, "(?<=\\s)(?=.*\\.)[.\\w] $"))) %>%
  group_by(name) %>%
  summarise(across(c(value1, value2), sum))
# A tibble: 3 x 3
  name                      value1 value2
* <chr>                      <dbl>  <dbl>
1 Acer laurinum                  3      5
2 Acmella paniculata             3      4
3 Adinandra cf. integerrima      9     11

How the regex works:

(?<=\\s): asserts a whitespace character to the left
(?=.*\\.): asserts the occurrence of a .anywhere in what follows
[.\\w] $: a character class containing . and alphanumeric characters at the end of the string

EDIT:

if you want to keep your original name column:

df %>%
  mutate(author = str_extract(name, "(?<=\\s)(?=.*\\.)[.\\w] $"),
         name = trimws(str_remove(name, "(?<=\\s)(?=.*\\.)[.\\w] $"))) %>%
  group_by(name) %>%
  fill(author, .direction = "up") %>%
  group_by(name, author) %>%
  summarise(across(c(value1, value2), sum)) %>%
  mutate(name = ifelse(!is.na(author), paste(name, author, sep = " "), name)) %>%
  select(-author)
# A tibble: 3 x 3
# Groups:   name [3]
  name                                 value1 value2
  <chr>                                 <dbl>  <dbl>
1 Acer laurinum Hassk.                      3      5
2 Acmella paniculata                        3      4
3 Adinandra cf. integerrima T.Anderson      9     11

CodePudding user response：

Here's a solution based on defining each row as a dynamic pattern and matching the words that occur across pairs of rows:

# function to automatically *escape* metacharacters:
escape.for.regex <- function(string) {
  gsub("([][{}() *^${|\\\\?.])", "\\\\\\1", string)
}

# stopwords:
stopwords <- c("cf", "and")
stopwords_pattern <- paste0("\\b(", paste0(stopwords, collapse = "|"), ")\\b")

# initialize vectors/columns:
pattern1 <- c()
df$repeats <- NA

# for loop with stopwords to detect repeated words across turns:
for(i in 2:nrow(df)){
  pattern1[i-1] <- paste0("(?!", stopwords_pattern, ")(?<!\\S)(?:", paste0(escape.for.regex(unlist(str_split(trimws(df$name[i-1]), "\\s "))), collapse = "|"), ")(?!\\S)")
  df$repeats[i] <- str_extract_all(df$name[i], pattern1[i-1])
}

library(dplyr)
df %>%
  mutate(repeats = gsub(" ", "", sapply(repeats, toString)),
         author = str_extract(name, "(?<=\\s)(?=.*\\.)[.\\w] $"),
         name = trimws(str_remove(name, "(?<=\\s)(?=.*\\.)[.\\w] $")),
         repeats = ifelse(repeats == "NA"|repeats == "", NA, repeats),
         repeats = ifelse(is.na(repeats), lead(repeats), repeats)) %>%
  group_by(name) %>%
  fill(author, .direction = "up") %>%
  group_by(name, author) %>%
  summarise(across(c(value1, value2), sum)) %>%
  mutate(name = ifelse(!is.na(author), paste(name, author, sep = " "), name)) %>%
  select(-author)
# A tibble: 3 x 3
# Groups:   name [3]
  name                                 value1 value2
  <chr>                                 <dbl>  <dbl>
1 Acer laurinum Hassk.                      3      5
2 Acmella paniculata                        3      4
3 Adinandra cf. integerrima T.Anderson      9     11