I am trying to reshape a data frame, and the task is just different enough from a typical reshape that I can't quite swing it. The table is originally in excel, but I've already imported to R, so I can work with a solution in either software.
I have a data frame donors
containing names, values, and months:
NAME AMT MONTH
brett 100 jan
brett 100 mar
brett 100 nov
brett 50 apr
brett 50 jun
jane 75 dec
sam 200 jan
sam 200 feb
sam 200 mar
I need to reshape this data frame so that there is a NAME column, an AMT column, and a column for each month. Each row corresponds to any observation that is unique in both name and amount, and the month columns are binary variables (1 if that donor/amount combo occurred that month, and 0 if not.
So the above example would work out to:
NAME AMT JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC
brett 100 1 0 1 0 0 0 0 0 0 0 1 0
brett 50 0 0 0 1 0 1 0 0 0 0 0 0
jane 75 0 0 0 0 0 0 0 0 0 0 0 1
sam 200 1 1 1 0 0 0 0 0 0 0 0 0
if you want to complete my code, create a final column equal to the total number of times that NAME/AMT combo occurred (just equal to the sum of the month columns)... but I can do that myself if you aren't feeling it.
Thank you!
CodePudding user response:
Here's a tidyverse
attempt -
complete
to create rows with missing months.arrange
so that you have columns in correct order.- Get data in wide format using
pivot_wider
.
library(dplyr)
library(tidyr)
df %>%
complete(MONTH = tolower(month.abb)) %>%
arrange(match(MONTH, tolower(month.abb))) %>%
pivot_wider(names_from = MONTH, values_from = MONTH,
values_fn = length, values_fill = 0) %>%
na.omit
# NAME AMT jan feb mar apr may jun jul aug sep oct nov dec
# <chr> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
#1 brett 100 1 0 1 0 0 0 0 0 0 0 1 0
#2 sam 200 1 1 1 0 0 0 0 0 0 0 0 0
#3 brett 50 0 0 0 1 0 1 0 0 0 0 0 0
#4 jane 75 0 0 0 0 0 0 0 0 0 0 0 1
Note that month.abb
is inbuilt R vector which gives abbreviated month names. I used tolower(month.abb)
to match it with the actual column names in the data.
data
df <- structure(list(NAME = c("brett", "brett", "brett", "brett", "brett",
"jane", "sam", "sam", "sam"), AMT = c(100L, 100L, 100L, 50L,
50L, 75L, 200L, 200L, 200L), MONTH = c("jan", "mar", "nov", "apr",
"jun", "dec", "jan", "feb", "mar")),
class = "data.frame", row.names = c(NA, -9L))
CodePudding user response:
Here's my approach with dcast :
library(reshape2)
library(Hmisc)
my_df <- data.frame("NAME" = c(rep("brett", 5), "jane", rep("sam", 3)),
"AMT" = c(rep(100, 3), rep(50, 2), 75, rep(200, 3)),
"MONTH" = c("jan", "mar", "nov", "apr", "jun", "dec", "jan", "feb", "mar"),
stringsAsFactors = FALSE)
my_df <- dcast(my_df, NAME AMT ~ MONTH, value.var = "MONTH", fun.aggregate = length)
my_df <- my_df[order(my_df$NAME, -my_df$AMT),]
my_miss <- setdiff(month.abb, capitalize(colnames(my_df)))
my_df[tolower(my_miss)] <- 0
my_df <- my_df[, c("NAME", "AMT", tolower(month.abb))]
CodePudding user response:
A base R approach.
First aggregate
the data to get the grouping
dat_agg <- aggregate( . ~ AMT ,dat, unique )
dat_agg
AMT NAME MONTH
1 50 brett apr, jun
2 75 jane dec
3 100 brett jan, mar, nov
4 200 sam jan, feb, mar
Then add the desired table of months
dat_agg <- cbind( dat_agg, setNames( data.frame(
matrix( 0, nrow(dat_agg), length(month.abb) ) ), tolower(month.abb) ) )
dat_agg
AMT NAME MONTH jan feb mar apr may jun jul aug sep oct nov dec
1 50 brett apr, jun 0 0 0 0 0 0 0 0 0 0 0 0
2 75 jane dec 0 0 0 0 0 0 0 0 0 0 0 0
3 100 brett jan, mar, nov 0 0 0 0 0 0 0 0 0 0 0 0
4 200 sam jan, feb, mar 0 0 0 0 0 0 0 0 0 0 0 0
Finally populate the binary data by matching the colnames with the list of MONTH
dat_agg[,-c(1:3)][ t( sapply( seq_along(dat_agg$MONTH), function(x)
colnames(dat_agg[,-c(1:3)]) %in% dat_agg$MONTH[[x]] ) ) ] <- 1
dat_agg[,-3]
AMT NAME jan feb mar apr may jun jul aug sep oct nov dec
1 50 brett 0 0 0 1 0 1 0 0 0 0 0 0
2 75 jane 0 0 0 0 0 0 0 0 0 0 0 1
3 100 brett 1 0 1 0 0 0 0 0 0 0 1 0
4 200 sam 1 1 1 0 0 0 0 0 0 0 0 0
Data
dat <- structure(list(NAME = c("brett", "brett", "brett", "brett", "brett",
"jane", "sam", "sam", "sam"), AMT = c(100L, 100L, 100L, 50L,
50L, 75L, 200L, 200L, 200L), MONTH = c("jan", "mar", "nov", "apr",
"jun", "dec", "jan", "feb", "mar")), class = "data.frame", row.names = c(NA,
-9L))
CodePudding user response:
Same solution as provided above, just using data.table
donors <- structure(list(NAME = c("brett", "brett", "brett", "brett", "brett", "jane", "sam", "sam", "sam"), AMT = c(100L, 100L, 100L, 50L, 50L, 75L, 200L, 200L, 200L), MONTH = c("jan", "mar", "nov", "apr",
"jun", "dec", "jan", "feb", "mar")), row.names = c(NA, -9L), class = c("data.frame"))
library(data.table)
setDT(donors) # make a data.table
m <- tolower(month.abb)
donors <- dcast(donors, NAME AMT ~ MONTH, fun.aggregate = length)
donors[, setdiff(m, names(donors)) := 0L]
setcolorder(donors, c(c("NAME", "AMT"), m))
donors[, total := rowSums(.SD), .SDcols = m]
setnames(donors, old = m, new = toupper(m))
donors
NAME AMT JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC total
1: brett 50 0 0 0 1 0 1 0 0 0 0 0 0 2
2: brett 100 1 0 1 0 0 0 0 0 0 0 1 0 3
3: jane 75 0 0 0 0 0 0 0 0 0 0 0 1 1
4: sam 200 1 1 1 0 0 0 0 0 0 0 0 0 3