Home > database >  Collapsing Dataframe Rows along several variables
Collapsing Dataframe Rows along several variables

Time:03-30

I have a dataframe that looks something like this, in which I have several rows for each user, and many NAs in the columns.

user Effect T1 Effect T2 Effect T3 Benchmark T1 Benchmark T2 Benchmark T3
Tom 01 NA NA 02 NA NA
Tom NA 07 NA NA 08 NA
Tom NA NA 13 NA NA 14
Larry 03 NA NA 04 NA NA
Larry NA 09 NA NA 10 NA
Larry NA NA 15 NA NA 16
Dave 05 NA NA 06 NA NA
Dave NA 11 NA NA 12 NA
Dave NA NA 17 NA NA 18

I want to collapse the columns using the name and filling the values from reach row, this this.

user Effect T1 Effect T2 Effect T3 Benchmark T1 Benchmark T2 Benchmark T3
Tom 01 07 13 02 08 14
Larry 03 09 15 04 10 16
Dave 05 11 17 06 12 18

How might I accomplish this?

Thank you in advance for your help. Update: I've added the dput of a subset of the actual data below.

structure(list(name = c("Abraham_Ralph", "Abraham_Ralph", "Abraham_Ralph", 
"Ackerman_Gary", "Adams_Alma", "Adams_Alma", "Adams_Alma", "Adams_Alma", 
"Adams_Sandy", "Aderholt_Robert", "Aderholt_Robert", "Aderholt_Robert", 
"Aderholt_Robert", "Aderholt_Robert", "Aguilar_Pete", "Aguilar_Pete", 
"Aguilar_Pete"), state = c("LA", "LA", "LA", "NY", "NC", "NC", 
"NC", "NC", "FL", "AL", "AL", "AL", "AL", "AL", "CA", "CA", "CA"
), seniority = c(1, 2, 3, 15, 1, 2, 3, 4, 1, 8, 9, 10, 11, 12, 
1, 2, 3), legeffect_112 = c(NA, NA, NA, 0.202061712741852, NA, 
NA, NA, NA, 1.30758035182953, 3.73544979095459, NA, NA, NA, NA, 
NA, NA, NA), legeffect_113 = c(NA, NA, NA, NA, 0, NA, NA, NA, 
NA, NA, 0.908495426177979, NA, NA, NA, NA, NA, NA), legeffect_114 = c(2.07501077651978, 
NA, NA, NA, NA, 0.84164834022522, NA, NA, NA, NA, NA, 0.340001106262207, 
NA, NA, 0.10985741019249, NA, NA), legeffect_115 = c(NA, 0.493490308523178, 
NA, NA, NA, NA, 0.587624311447144, NA, NA, NA, NA, NA, 0.159877583384514, 
NA, NA, 0.730929613113403, NA), legeffect_116 = c(NA, NA, 0.0397605448961258, 
NA, NA, NA, NA, 1.78378939628601, NA, NA, NA, NA, NA, 0.0198802724480629, 
NA, NA, 0.0497006773948669), benchmark_112 = c(NA, NA, NA, 0.738679468631744, 
NA, NA, NA, NA, 0.82908970117569, 1.39835929870605, NA, NA, NA, 
NA, NA, NA, NA), benchmark_113 = c(NA, NA, NA, NA, 0.391001850366592, 
NA, NA, NA, NA, NA, 1.58223271369934, NA, NA, NA, NA, NA, NA), 
    benchmark_114 = c(1.40446054935455, NA, NA, NA, NA, 0.576326191425323, 
    NA, NA, NA, NA, NA, 1.42212760448456, NA, NA, 0.574363172054291, 
    NA, NA), benchmark_115 = c(NA, 1.3291300535202, NA, NA, NA, 
    NA, 0.537361204624176, NA, NA, NA, NA, NA, 1.45703768730164, 
    NA, NA, 0.523149251937866, NA), benchmark_116 = c(NA, NA, 
    0.483340591192245, NA, NA, NA, NA, 1.31058621406555, NA, 
    NA, NA, NA, NA, 0.751261711120605, NA, NA, 1.05683290958405
    )), row.names = c(NA, -17L), class = c("tbl_df", "tbl", "data.frame"
))

CodePudding user response:

A data.table solution:

# melt data, remove NA, then recast ...
dt <- dcast(melt(d, "user")[!value %in% NA], user ~ variable)

dt
#      user Effect T1 Effect T2 Effect T3 Benchmark T1 Benchmark T2 Benchmark T3
# 1:   Dave         5        11        17            6           12           18
# 2:  Larry         3         9        15            4           10           16
# 3:    Tom         1         7        13            2            8           14

Data/Setup

# Load data.table
# install.packages("data.table")
library(data.table)

# Read example data
d <- fread("user,Effect T1,Effect T2,Effect T3,Benchmark T1,Benchmark T2,Benchmark T3
Tom,01,NA,NA,02,NA,NA
Tom,NA,07,NA,NA,08,NA
Tom,NA,NA,13,NA,NA,14
Larry,03,NA,NA,04,NA,NA
Larry,NA,09,NA,NA,10,NA
Larry,NA,NA,15,NA,NA,16
Dave,05,NA,NA,06,NA,NA
Dave,NA,11,NA,NA,12,NA
Dave,NA,NA,17,NA,NA,18")

CodePudding user response:

This solution is using only the base functions (no extra packages), but the one-liner may cause eyes to cross, so I'll split it into several functions.

The plan is the following:

  1. Split the original data.frame by the values in name column, using the function by;
  2. For each partition of the data.frame, collapse the columns;
  3. A collapsed column returns the max value of the column, or NA if all its values are NA;
  4. The collapsed data.frame partitions are stacked together.

So, this is a function that does that:

dfr_collapse <- function(dfr, col0)
{
    # Collapse the columns of the data.frame "dfr" grouped by the values of
    # the column "col0"

    # Max/NA function
    namax <- function(x)
    {
        if(all(is.na(x)))
            NA   # !!!
        else
            max(x, na.rm=TRUE)
    }

    # Column collapse function
    byfun <- function(x)
    {
        lapply(x, namax)
    }

    # Stack the partitioning results
    return(do.call(
       what = rbind,
       args = by(dfr, dfr[[col0]], byfun)
    ))
}

May not look as slick as a one-liner, but it does the job. It can be tunrned into a one-liner, but you don't want that.

Assuming that df0 is the data.frame from you dput, you can test this function with

dfr_collapse(df0)

Nota bene: for the sake of simplicity, I return an NA of type logical (see the comment # !!! above). The correct code should convert that NA to the mode of the x vector. Also, the function should check the type of its inputs, etc.

  • Related