I have a dataframe that looks something like this, in which I have several rows for each user, and many NAs in the columns.
user | Effect T1 | Effect T2 | Effect T3 | Benchmark T1 | Benchmark T2 | Benchmark T3 |
---|---|---|---|---|---|---|
Tom | 01 | NA | NA | 02 | NA | NA |
Tom | NA | 07 | NA | NA | 08 | NA |
Tom | NA | NA | 13 | NA | NA | 14 |
Larry | 03 | NA | NA | 04 | NA | NA |
Larry | NA | 09 | NA | NA | 10 | NA |
Larry | NA | NA | 15 | NA | NA | 16 |
Dave | 05 | NA | NA | 06 | NA | NA |
Dave | NA | 11 | NA | NA | 12 | NA |
Dave | NA | NA | 17 | NA | NA | 18 |
I want to collapse the columns using the name and filling the values from reach row, this this.
user | Effect T1 | Effect T2 | Effect T3 | Benchmark T1 | Benchmark T2 | Benchmark T3 |
---|---|---|---|---|---|---|
Tom | 01 | 07 | 13 | 02 | 08 | 14 |
Larry | 03 | 09 | 15 | 04 | 10 | 16 |
Dave | 05 | 11 | 17 | 06 | 12 | 18 |
How might I accomplish this?
Thank you in advance for your help. Update: I've added the dput of a subset of the actual data below.
structure(list(name = c("Abraham_Ralph", "Abraham_Ralph", "Abraham_Ralph",
"Ackerman_Gary", "Adams_Alma", "Adams_Alma", "Adams_Alma", "Adams_Alma",
"Adams_Sandy", "Aderholt_Robert", "Aderholt_Robert", "Aderholt_Robert",
"Aderholt_Robert", "Aderholt_Robert", "Aguilar_Pete", "Aguilar_Pete",
"Aguilar_Pete"), state = c("LA", "LA", "LA", "NY", "NC", "NC",
"NC", "NC", "FL", "AL", "AL", "AL", "AL", "AL", "CA", "CA", "CA"
), seniority = c(1, 2, 3, 15, 1, 2, 3, 4, 1, 8, 9, 10, 11, 12,
1, 2, 3), legeffect_112 = c(NA, NA, NA, 0.202061712741852, NA,
NA, NA, NA, 1.30758035182953, 3.73544979095459, NA, NA, NA, NA,
NA, NA, NA), legeffect_113 = c(NA, NA, NA, NA, 0, NA, NA, NA,
NA, NA, 0.908495426177979, NA, NA, NA, NA, NA, NA), legeffect_114 = c(2.07501077651978,
NA, NA, NA, NA, 0.84164834022522, NA, NA, NA, NA, NA, 0.340001106262207,
NA, NA, 0.10985741019249, NA, NA), legeffect_115 = c(NA, 0.493490308523178,
NA, NA, NA, NA, 0.587624311447144, NA, NA, NA, NA, NA, 0.159877583384514,
NA, NA, 0.730929613113403, NA), legeffect_116 = c(NA, NA, 0.0397605448961258,
NA, NA, NA, NA, 1.78378939628601, NA, NA, NA, NA, NA, 0.0198802724480629,
NA, NA, 0.0497006773948669), benchmark_112 = c(NA, NA, NA, 0.738679468631744,
NA, NA, NA, NA, 0.82908970117569, 1.39835929870605, NA, NA, NA,
NA, NA, NA, NA), benchmark_113 = c(NA, NA, NA, NA, 0.391001850366592,
NA, NA, NA, NA, NA, 1.58223271369934, NA, NA, NA, NA, NA, NA),
benchmark_114 = c(1.40446054935455, NA, NA, NA, NA, 0.576326191425323,
NA, NA, NA, NA, NA, 1.42212760448456, NA, NA, 0.574363172054291,
NA, NA), benchmark_115 = c(NA, 1.3291300535202, NA, NA, NA,
NA, 0.537361204624176, NA, NA, NA, NA, NA, 1.45703768730164,
NA, NA, 0.523149251937866, NA), benchmark_116 = c(NA, NA,
0.483340591192245, NA, NA, NA, NA, 1.31058621406555, NA,
NA, NA, NA, NA, 0.751261711120605, NA, NA, 1.05683290958405
)), row.names = c(NA, -17L), class = c("tbl_df", "tbl", "data.frame"
))
CodePudding user response:
A data.table
solution:
# melt data, remove NA, then recast ...
dt <- dcast(melt(d, "user")[!value %in% NA], user ~ variable)
dt
# user Effect T1 Effect T2 Effect T3 Benchmark T1 Benchmark T2 Benchmark T3
# 1: Dave 5 11 17 6 12 18
# 2: Larry 3 9 15 4 10 16
# 3: Tom 1 7 13 2 8 14
Data/Setup
# Load data.table
# install.packages("data.table")
library(data.table)
# Read example data
d <- fread("user,Effect T1,Effect T2,Effect T3,Benchmark T1,Benchmark T2,Benchmark T3
Tom,01,NA,NA,02,NA,NA
Tom,NA,07,NA,NA,08,NA
Tom,NA,NA,13,NA,NA,14
Larry,03,NA,NA,04,NA,NA
Larry,NA,09,NA,NA,10,NA
Larry,NA,NA,15,NA,NA,16
Dave,05,NA,NA,06,NA,NA
Dave,NA,11,NA,NA,12,NA
Dave,NA,NA,17,NA,NA,18")
CodePudding user response:
This solution is using only the base
functions (no extra packages), but the one-liner may cause eyes to cross, so I'll split it into several functions.
The plan is the following:
- Split the original
data.frame
by the values inname
column, using the functionby
; - For each partition of the
data.frame
, collapse the columns; - A collapsed column returns the max value of the column, or
NA
if all its values areNA
; - The collapsed
data.frame
partitions are stacked together.
So, this is a function that does that:
dfr_collapse <- function(dfr, col0)
{
# Collapse the columns of the data.frame "dfr" grouped by the values of
# the column "col0"
# Max/NA function
namax <- function(x)
{
if(all(is.na(x)))
NA # !!!
else
max(x, na.rm=TRUE)
}
# Column collapse function
byfun <- function(x)
{
lapply(x, namax)
}
# Stack the partitioning results
return(do.call(
what = rbind,
args = by(dfr, dfr[[col0]], byfun)
))
}
May not look as slick as a one-liner, but it does the job. It can be tunrned into a one-liner, but you don't want that.
Assuming that df0
is the data.frame
from you dput
, you can test this function with
dfr_collapse(df0)
Nota bene: for the sake of simplicity, I return an NA
of type logical
(see the comment # !!!
above). The correct code should convert that NA
to the mode of the x
vector. Also, the function should check the type of its inputs, etc.