I have a large dataset with each row representing a different category. However, some of categories have been miss spelled or miss named resulting in two rows for the same category. I would like to combine and sum specific rows.
DF
category | 2021 | 2022 | 2023 |
---|---|---|---|
Grain | 890 | 900 | 978 |
Dried Fruits and Veg | 45 | 55 | 58 |
Dried Fruits & Veg | 66 | 74 | 88 |
Expected output
category | 2021 | 2022 | 2023 |
---|---|---|---|
Grain | 890 | 900 | 978 |
Dried Fruits and Veg | 111 | 62 | 146 |
I've been trying something like the strategy below but haven't been able to work it out.
DF %>%
mutate_at(c(2021:2023), cumsum(starts_with("Dried Fruits")))
Data:
df <- data.frame(
stringsAsFactors = FALSE,
category = c("Grain",
"Dried Fruits and Veg","Dried Fruits & Veg"),
"2021" = c(890L, 45L, 66L),
"2020" = c(900L, 55L, 74L),
"2023" = c(978L, 58L, 88L)
)
CodePudding user response:
Here's a suggested generalized workflow to deal with typos. It's pretty simple, based on number of characters off, so it won't manage data sets with subtly different categories. (For instance if you have real values "Food" and "Foot", those are only 1 character apart, so this wouldn't distinguish between those real values and a wrong value of "Fooz.")
Here, I first count the times each category value appears. I will assume here that correct values appear more than wrong values.
library(dplyr)
df_counts <- df %>%
count(category)
Now I look for pairs of categories where the values are unequal, but "not far" (here I arbitrarily used 5 character replacements as the max), and noted the more frequent one:
replacements <- fuzzyjoin::stringdist_left_join(df_counts, df_counts, by = "category",
max_dist = 5, distance_col = "dist") %>%
filter(n.x > n.y) %>%
select(category = category.y, category_new = category.x)
Finally, we can replace the typos with their more frequent correct (I assume) version:
df %>%
left_join(replacements) %>%
mutate(category = coalesce(category_new, category))
In my example data, it replaces "Driedd Fruits and Veg" with "Dried Fruits & Veg".
Joining, by = "category"
category X2021 X2020 X2023 category_new
1 Grain 890 900 978 <NA>
2 Dried Fruits & Veg 45 55 58 Dried Fruits & Veg
3 Dried Fruits & Veg 66 74 88 <NA>
4 Dried Fruits & Veg 21 22 23 <NA>
Depending on your data, it might make sense to run a unifying step (like replacing "&" with "and") first on your data before any of these steps, to bring the typo categories closer to their correct counterparts, so that you can use a more picky join distance to avoid false matches.
My fake data for demonstration:
df <- data.frame(
stringsAsFactors = FALSE,
category = c("Grain",
"Driedd Fruits and Veg","Dried Fruits & Veg", "Dried Fruits & Veg"),
"2021" = c(890L, 45L, 66L, 21L),
"2020" = c(900L, 55L, 74L, 22L),
"2023" = c(978L, 58L, 88L, 23L)
)
CodePudding user response:
This solution will possibly work just for this given scenario because I don't know what other types of misspelling and misnaming are there in the category column, but once you clean the category column to have all unique categories, you can simply do a group_by
by category column and then summarise
over the rest of the columns, like the following,
library(dplyr)
library(stringr)
df <- data.frame(
stringsAsFactors = FALSE,
category = c("Grain",
"Dried Fruits and Veg","Dried Fruits & Veg"),
"2021" = c(890L, 45L, 66L),
"2020" = c(900L, 55L, 74L),
"2023" = c(978L, 58L, 88L)
)
df %>%
mutate(
category = str_replace(category, "&", "and")
) %>%
group_by(category) %>%
summarise(
across(X2021:X2023, sum)
)
#> # A tibble: 2 × 4
#> category X2021 X2020 X2023
#> <chr> <int> <int> <int>
#> 1 Dried Fruits and Veg 111 129 146
#> 2 Grain 890 900 978
Created on 2022-07-11 by the reprex package (v2.0.1)
Hope this helps.
CodePudding user response:
Using startsWith
is dangerous, it could fail if there are e.g. 'Dried Fruits & Nuts'
. Defining all the special cases in the aggregation code itself is tedious and makes it rather unreadable. Better get in the habit of using assignment dictionaries, where you define all the "translations" in a preceding step. It doesn't hurt much and is useful in many cases.
a <- read.table(header=TRUE, text='
FROM TO
Grain Grain
"Dried Fruits and Veg" "Dried Fruits and Veg"
"Dried Fruits & Veg" "Dried Fruits and Veg"
')
aggregate(. ~ category, transform(dat, category=a[match(category, a$V1), 2]), sum)
# category X2021 X2022 X2023
# 1 Dried Fruits and Veg 111 129 146
# 2 Grain 890 900 978
Data:
dat <- structure(list(category = c("Grain", "Dried Fruits and Veg",
"Dried Fruits & Veg"), X2021 = c(890L, 45L, 66L), X2022 = c(900L,
55L, 74L), X2023 = c(978L, 58L, 88L)), class = "data.frame", row.names = c(NA,
-3L))
CodePudding user response:
You can try out the following:
Replicating your data:
DF<-data.frame(category=c("Grain","Dried Fruits and Veg","Dried Fruits & Veg"),"2021"=c(890,45,66),"2022"=c(900,55,74),"2023"=c(978,58,88))
colnames(DF)<-c("category",2021,2022,2023)
DF2<-data.frame(category=c("Grain","Dried Fruits and Veg"),"2021"=c(890,sum(DF$`2021`[grep("Dried",DF$category)])),"2022"=c(900,sum(DF$`2022`[grep("Dried",DF$category)])),"2023"=c(978,sum(DF$`2023`[grep("Dried",DF$category)])))
colnames(DF2)<-c("category",2021,2022,2023)
Using "grep", I first find out the row position which contains the target word and then we can subset the row variables and calculate their sum accordingly.