How do I execute this unique reshaping in R (or excel)?-CodePudding

I am trying to reshape a data frame, and the task is just different enough from a typical reshape that I can't quite swing it. The table is originally in excel, but I've already imported to R, so I can work with a solution in either software.

I have a data frame donors containing names, values, and months:

NAME    AMT    MONTH
brett   100    jan
brett   100    mar
brett   100    nov
brett   50     apr
brett   50     jun
jane    75     dec
sam     200    jan
sam     200    feb
sam     200    mar

I need to reshape this data frame so that there is a NAME column, an AMT column, and a column for each month. Each row corresponds to any observation that is unique in both name and amount, and the month columns are binary variables (1 if that donor/amount combo occurred that month, and 0 if not.

So the above example would work out to:

NAME   AMT   JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC
brett  100   1   0   1   0   0   0   0   0   0   0   1   0
brett  50    0   0   0   1   0   1   0   0   0   0   0   0
jane   75    0   0   0   0   0   0   0   0   0   0   0   1
sam    200   1   1   1   0   0   0   0   0   0   0   0   0

if you want to complete my code, create a final column equal to the total number of times that NAME/AMT combo occurred (just equal to the sum of the month columns)... but I can do that myself if you aren't feeling it.

Thank you!

CodePudding user response：

Here's a tidyverse attempt -

complete to create rows with missing months.
arrange so that you have columns in correct order.
Get data in wide format using pivot_wider.

library(dplyr)
library(tidyr)

df %>%
  complete(MONTH = tolower(month.abb)) %>%
  arrange(match(MONTH, tolower(month.abb))) %>%
  pivot_wider(names_from = MONTH, values_from = MONTH, 
              values_fn = length, values_fill = 0) %>%
  na.omit

#  NAME    AMT   jan   feb   mar   apr   may   jun   jul   aug   sep   oct   nov   dec
#  <chr> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
#1 brett   100     1     0     1     0     0     0     0     0     0     0     1     0
#2 sam     200     1     1     1     0     0     0     0     0     0     0     0     0
#3 brett    50     0     0     0     1     0     1     0     0     0     0     0     0
#4 jane     75     0     0     0     0     0     0     0     0     0     0     0     1

Note that month.abb is inbuilt R vector which gives abbreviated month names. I used tolower(month.abb) to match it with the actual column names in the data.

data

df <- structure(list(NAME = c("brett", "brett", "brett", "brett", "brett", 
"jane", "sam", "sam", "sam"), AMT = c(100L, 100L, 100L, 50L, 
50L, 75L, 200L, 200L, 200L), MONTH = c("jan", "mar", "nov", "apr", 
"jun", "dec", "jan", "feb", "mar")), 
class = "data.frame", row.names = c(NA, -9L))

CodePudding user response：

Here's my approach with dcast :

library(reshape2)
library(Hmisc)

my_df <- data.frame("NAME" = c(rep("brett", 5), "jane", rep("sam", 3)),
                "AMT" = c(rep(100, 3), rep(50, 2), 75, rep(200, 3)),
                "MONTH" = c("jan", "mar", "nov", "apr", "jun", "dec", "jan", "feb", "mar"),
                stringsAsFactors = FALSE)

my_df <- dcast(my_df, NAME AMT ~ MONTH, value.var = "MONTH", fun.aggregate = length)
my_df <- my_df[order(my_df$NAME, -my_df$AMT),]
my_miss <- setdiff(month.abb, capitalize(colnames(my_df)))
my_df[tolower(my_miss)] <- 0
my_df <- my_df[, c("NAME", "AMT", tolower(month.abb))]

CodePudding user response：

A base R approach.
First aggregate the data to get the grouping

dat_agg <- aggregate( . ~ AMT ,dat, unique )

dat_agg
  AMT  NAME         MONTH
1  50 brett      apr, jun
2  75  jane           dec
3 100 brett jan, mar, nov
4 200   sam jan, feb, mar

Then add the desired table of months

dat_agg <- cbind( dat_agg, setNames( data.frame( 
  matrix( 0, nrow(dat_agg), length(month.abb) ) ), tolower(month.abb) ) )

dat_agg
  AMT  NAME         MONTH jan feb mar apr may jun jul aug sep oct nov dec
1  50 brett      apr, jun   0   0   0   0   0   0   0   0   0   0   0   0
2  75  jane           dec   0   0   0   0   0   0   0   0   0   0   0   0
3 100 brett jan, mar, nov   0   0   0   0   0   0   0   0   0   0   0   0
4 200   sam jan, feb, mar   0   0   0   0   0   0   0   0   0   0   0   0

Finally populate the binary data by matching the colnames with the list of MONTH

dat_agg[,-c(1:3)][ t( sapply( seq_along(dat_agg$MONTH), function(x)
  colnames(dat_agg[,-c(1:3)]) %in% dat_agg$MONTH[[x]] ) ) ] <- 1

dat_agg[,-3]
  AMT  NAME jan feb mar apr may jun jul aug sep oct nov dec
1  50 brett   0   0   0   1   0   1   0   0   0   0   0   0
2  75  jane   0   0   0   0   0   0   0   0   0   0   0   1
3 100 brett   1   0   1   0   0   0   0   0   0   0   1   0
4 200   sam   1   1   1   0   0   0   0   0   0   0   0   0

Data

dat <- structure(list(NAME = c("brett", "brett", "brett", "brett", "brett", 
"jane", "sam", "sam", "sam"), AMT = c(100L, 100L, 100L, 50L, 
50L, 75L, 200L, 200L, 200L), MONTH = c("jan", "mar", "nov", "apr", 
"jun", "dec", "jan", "feb", "mar")), class = "data.frame", row.names = c(NA, 
-9L))

CodePudding user response：

Same solution as provided above, just using data.table

donors <- structure(list(NAME = c("brett", "brett", "brett", "brett", "brett", "jane", "sam", "sam", "sam"), AMT = c(100L, 100L, 100L, 50L, 50L, 75L, 200L, 200L, 200L), MONTH = c("jan", "mar", "nov", "apr", 
"jun", "dec", "jan", "feb", "mar")), row.names = c(NA, -9L), class = c("data.frame"))

library(data.table)
setDT(donors) # make a data.table

m <- tolower(month.abb)
donors <- dcast(donors, NAME   AMT ~ MONTH, fun.aggregate = length)
donors[, setdiff(m, names(donors)) := 0L]
setcolorder(donors, c(c("NAME", "AMT"), m))
donors[, total := rowSums(.SD), .SDcols = m]
setnames(donors, old = m, new = toupper(m))

donors

    NAME AMT JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC total
1: brett  50   0   0   0   1   0   1   0   0   0   0   0   0     2
2: brett 100   1   0   1   0   0   0   0   0   0   0   1   0     3
3:  jane  75   0   0   0   0   0   0   0   0   0   0   0   1     1
4:   sam 200   1   1   1   0   0   0   0   0   0   0   0   0     3