Combine / sum rows in R-CodePudding

I have a large dataset with each row representing a different category. However, some of categories have been miss spelled or miss named resulting in two rows for the same category. I would like to combine and sum specific rows.

category	2021	2022	2023
Grain	890	900	978
Dried Fruits and Veg	45	55	58
Dried Fruits & Veg	66	74	88

Expected output

category	2021	2022	2023
Grain	890	900	978
Dried Fruits and Veg	111	62	146

I've been trying something like the strategy below but haven't been able to work it out.

DF %>% 
  mutate_at(c(2021:2023), cumsum(starts_with("Dried Fruits")))

Data:

df <- data.frame(
  stringsAsFactors = FALSE,
              category = c("Grain",
                       "Dried Fruits and Veg","Dried Fruits & Veg"),
              "2021" = c(890L, 45L, 66L),
              "2020" = c(900L, 55L, 74L),
              "2023" = c(978L, 58L, 88L)
)

CodePudding user response：

Here's a suggested generalized workflow to deal with typos. It's pretty simple, based on number of characters off, so it won't manage data sets with subtly different categories. (For instance if you have real values "Food" and "Foot", those are only 1 character apart, so this wouldn't distinguish between those real values and a wrong value of "Fooz.")

Here, I first count the times each category value appears. I will assume here that correct values appear more than wrong values.

library(dplyr)
df_counts <- df %>%
  count(category)

Now I look for pairs of categories where the values are unequal, but "not far" (here I arbitrarily used 5 character replacements as the max), and noted the more frequent one:

replacements <- fuzzyjoin::stringdist_left_join(df_counts, df_counts, by = "category", 
                                max_dist = 5, distance_col = "dist") %>%
  filter(n.x > n.y) %>%
  select(category = category.y, category_new = category.x)

Finally, we can replace the typos with their more frequent correct (I assume) version:

df %>%
  left_join(replacements) %>%
  mutate(category = coalesce(category_new, category))

In my example data, it replaces "Driedd Fruits and Veg" with "Dried Fruits & Veg".

Joining, by = "category"
            category X2021 X2020 X2023       category_new
1              Grain   890   900   978               <NA>
2 Dried Fruits & Veg    45    55    58 Dried Fruits & Veg
3 Dried Fruits & Veg    66    74    88               <NA>
4 Dried Fruits & Veg    21    22    23               <NA>

Depending on your data, it might make sense to run a unifying step (like replacing "&" with "and") first on your data before any of these steps, to bring the typo categories closer to their correct counterparts, so that you can use a more picky join distance to avoid false matches.

My fake data for demonstration:

df <- data.frame(
  stringsAsFactors = FALSE,
  category = c("Grain",
               "Driedd Fruits and Veg","Dried Fruits & Veg", "Dried Fruits & Veg"),
  "2021" = c(890L, 45L, 66L, 21L),
  "2020" = c(900L, 55L, 74L, 22L),
  "2023" = c(978L, 58L, 88L, 23L)
)

CodePudding user response：

This solution will possibly work just for this given scenario because I don't know what other types of misspelling and misnaming are there in the category column, but once you clean the category column to have all unique categories, you can simply do a group_by by category column and then summarise over the rest of the columns, like the following,

library(dplyr)
library(stringr)

df <- data.frame(
  stringsAsFactors = FALSE,
              category = c("Grain",
                       "Dried Fruits and Veg","Dried Fruits & Veg"),
              "2021" = c(890L, 45L, 66L),
              "2020" = c(900L, 55L, 74L),
              "2023" = c(978L, 58L, 88L)
)

df %>% 
  mutate(
    category = str_replace(category, "&", "and")
  ) %>% 
  group_by(category) %>% 
  summarise(
    across(X2021:X2023, sum)
  )
#> # A tibble: 2 × 4
#>   category             X2021 X2020 X2023
#>   <chr>                <int> <int> <int>
#> 1 Dried Fruits and Veg   111   129   146
#> 2 Grain                  890   900   978

^{Created on 2022-07-11 by the reprex package (v2.0.1)}

Hope this helps.

CodePudding user response：

Using startsWith is dangerous, it could fail if there are e.g. 'Dried Fruits & Nuts'. Defining all the special cases in the aggregation code itself is tedious and makes it rather unreadable. Better get in the habit of using assignment dictionaries, where you define all the "translations" in a preceding step. It doesn't hurt much and is useful in many cases.

a <- read.table(header=TRUE, text='
FROM                    TO
Grain                   Grain
"Dried Fruits and Veg" "Dried Fruits and Veg"
"Dried Fruits & Veg"   "Dried Fruits and Veg"
')

aggregate(. ~ category, transform(dat, category=a[match(category, a$V1), 2]), sum)
#               category X2021 X2022 X2023
# 1 Dried Fruits and Veg   111   129   146
# 2                Grain   890   900   978

Data:

dat <- structure(list(category = c("Grain", "Dried Fruits and Veg", 
"Dried Fruits & Veg"), X2021 = c(890L, 45L, 66L), X2022 = c(900L, 
55L, 74L), X2023 = c(978L, 58L, 88L)), class = "data.frame", row.names = c(NA, 
-3L))

CodePudding user response：

You can try out the following:

Replicating your data:

DF<-data.frame(category=c("Grain","Dried Fruits and Veg","Dried Fruits & Veg"),"2021"=c(890,45,66),"2022"=c(900,55,74),"2023"=c(978,58,88))
colnames(DF)<-c("category",2021,2022,2023)

DF2<-data.frame(category=c("Grain","Dried Fruits and Veg"),"2021"=c(890,sum(DF$`2021`[grep("Dried",DF$category)])),"2022"=c(900,sum(DF$`2022`[grep("Dried",DF$category)])),"2023"=c(978,sum(DF$`2023`[grep("Dried",DF$category)])))

colnames(DF2)<-c("category",2021,2022,2023)

Using "grep", I first find out the row position which contains the target word and then we can subset the row variables and calculate their sum accordingly.