**UPDATE
My data set contains 314090 observations in the following format:
UPDATEDID | BRIEF_ID | gamma | LDR_SUM | LDR_Topic | LDR_7Code |
---|---|---|---|---|---|
16 | 04999120040277 | 2.879744e-03 | 0.15326902 | supervises collective followers very closely | 1 |
Note: there are duplicate BRIEF_ID numbers (3205 unique #s), LDR_Topics (15 unique LDR topics with corresponding LDR_7Codes) so that is why the data is very long.
I'd like to reshape this data whereby each row is a unique #s (3205 rows) and each LDR_Topic (15) is it's own unique column (20 columns total) with its corresponding LDR_SUM as values within the column. For example:
UPDATEDID | BRIEF_ID | supervises collective followers very closely |
---|---|---|
16 | 04999120040277 | 0.15326902 |
So far I've tried:
BriefingGammas4<-reshape(data = BriefingGammas3,
idvar = c("UPDATEDID", "BRIEF_ID"),
timevar = "LDR_Topic",
direction = "wide")
But it aborts into a new session.
Any advice? Thank you!
***** UPDATE
I tried the following approaches but neither resulted in the correct table.
install.packages("data.table")
library (data.table)
BriefingGammas7 <- as.data.table(BriefingGammas6)
BriefingGammas7 <- dcast(BriefingGammas7, UPDATEDID BRIEF_ID ~ LDR_Topic, value.var = 'LDR_SUM')
This resulted in the correct 3205 rows, but the incorrect values for each LDR_Topic (they should not be identical, and should be decimals. These numbers seem to reflect LDR_7Code which is not in the dataset). See below for an example:
UPDATEDID | BRIEF_ID | acquired resources | distributed resources | enhanced |
---|---|---|---|---|
1 | 01999110036250 | 2 | 4 | 15 |
2 | 01999120041284 | 2 | 4 | 15 |
3 | 01999300213 | 2 | 4 | 15 |
I then tried this:
install.packages("tidyverse")
library (tidyverse)
BriefingGammas6 <- BriefingGammas5 |>
pivot_wider(names_from = LDR_Topic, values_from = LDR_SUM) |>
select(-c(gamma, LDR_7Code))
This resulted in the correct values per LDR_Topic, but the incorrect number of rows. It stayed at 314,090 rows instead of 3205. See example:
UPDATEDID | BRIEF_ID | acquired resources | distributed resources | enhanced |
---|---|---|---|---|
1 | 01999110036250 | 0.02843241 | NA | NA |
2 | 01999110036250 | NA | 0.010892233 | NA |
3 | 01999110036250 | NA | 0.010892233 | 0.006081761 |
4 | 01999110036250 | 0.02843241 | NA | 0.006081761 |
Basically, it filled out the values for each topic for 3205 rows (duplicating it multiple times) and then started filling out the values for the next topic. But I'd like to get the 3205 rows to look like this:
UPDATEDID | BRIEF_ID | acquired resources | distributed resources | enhanced |
---|---|---|---|---|
1 | 01999110036250 | 0.02843241 | 0.010892233 | 0.006081761 |
2 | 01999120041284 | 0.1594207 | 0.005315201 | 0.004850703 |
3 | 01999300213 | 0.4374699 | 0.01607505 | 0.003971634 |
The last one I tried was this:
BriefingGammas7<-reshape(data = BriefingGammas6,
idvar = c("UPDATEDID", "BRIEF_ID"),
timevar = "LDR_Topic",
v.names = "LDR_SUM",
direction = "wide")
Which resulted in this:
UPDATEDID | BRIEF_ID | "acquired resources", "distributed"... |
---|---|---|
1 | 01999110036250 | NA |
2 | 01999120041284 | NA |
No other rows came out.
SOLUTION UPDATE*
Step 1. reduce number of variables Step 2. Remove duplicate observations
BriefingGammas7 <- subset(BriefingGammas6, !duplicated(subset(BriefingGammas6, select=c(UPDATEDID, BRIEF_ID, LDR_SUM, LDR_Topic))))
Step 3. use tidy verse way in the comment below.
BriefingGammas8 <- BriefingGammas7 |>
pivot_wider(names_from = LDR_Topic, values_from = LDR_SUM)
CodePudding user response:
To make the case clearer, I tried to create a second row with dummy data that follows the pattern of data in the first row:
dput(dat)
structure(list(UPDATEDID = c(16, 17), BRIEF_ID = c("04999120040277",
"14999120040277"), gamma = c(879.744, 779.744), LDR_SUM = c(0.15326902,
0.25326902), LDR_Topic = c("supervises collective followers very closely",
"does something else"), LDR_7Code = c(1, 2)), class = "data.frame", row.names = c(NA,
-2L))
dat
UPDATEDID BRIEF_ID gamma LDR_SUM LDR_Topic LDR_7Code
1 16 04999120040277 879.744 0.153269 supervises collective followers very closely 1
2 17 14999120040277 779.744 0.253269 does something else 2
A base R way
dat |>
reshape(direction = "wide",
idvar = "UPDATEDID",
timevar ="LDR_Topic",
v.names = "LDR_SUM")|>
subset(select = -c(gamma, LDR_7Code))
# The result
# UPDATEDID BRIEF_ID LDR_SUM.supervises collective followers very closely LDR_SUM.does something else
#1 16 04999120040277 0.153269 NA
#2 17 14999120040277 NA 0.253269
A tidyverse way
library(tidyverse)
dat |>
pivot_wider(names_from = LDR_Topic, values_from = LDR_SUM) |>
select(-c(gamma, LDR_7Code))
#The result
# A tibble: 2 × 4
# UPDATEDID BRIEF_ID `supervises collective followers very closely` `does something else`
# <dbl> <chr> <dbl> <dbl>
#1 16 04999120040277 0.153 NA
#2 17 14999120040277 NA 0.253
A data.table way (recommended for memory efficiency)
library(data.table)
dat.dt <- as.data.table(dat)
dcast(dat.dt, UPDATEDID BRIEF_ID ~ LDR_Topic, value.var = 'LDR_SUM')
# The result
# UPDATEDID BRIEF_ID does something else supervises collective followers very closely
#1: 16 04999120040277 NA 0.153269
#2: 17 14999120040277 0.253269 NA
Updates
Based on your explanation, the tidyverse
way basically works on the right direction. The only problem is the duplicated rows that have NA
s in some of its columns and you want them to collapse into a single row. This is easy to do with fill()
and distinct()
functions. The only problem in your example is that the UPDATEDID
changed from 1,2,3,4
to 1
with no explanation. Hence, for now, I assume that we can ignore the UPDATEDID
(you can create a new column for it later) and we need only to consider BRIEF_ID
.
yourdf <- structure(list(UPDATEDID = 1:4, BRIEF_ID = c(1999110036250, 1999110036250,
1999110036250, 1999110036250), acquired_resources = c(0.02843241,
NA, NA, 0.02843241), distributed_resources = c(NA, 0.010892233,
0.010892233, NA), enhanced = c(NA, NA, 0.006081761, 0.006081761
)), class = "data.frame", row.names = c(NA, -4L))
yourdf # I change the space to '_' to make it easier to control
UPDATEDID BRIEF_ID acquired_resources distributed_resources enhanced
1 1 1.99911e 12 0.02843241 NA NA
2 2 1.99911e 12 NA 0.01089223 NA
3 3 1.99911e 12 NA 0.01089223 0.006081761
4 4 1.99911e 12 0.02843241 NA 0.006081761
yourdf[,-1] |>
fill(acquired_resources,distributed_resources,enhanced,
.direction = 'downup') |>
distinct()
# The result
BRIEF_ID acquired_resources distributed_resources enhanced
1 1.99911e 12 0.02843241 0.01089223 0.006081761
Then, the complete step would be:
dat |>
pivot_wider(names_from = LDR_Topic, values_from = LDR_SUM) |>
select(-c(gamma, LDR_7Code)) |>
fill(acquired_resources,distributed_resources,enhanced,
.direction = 'downup') |>
distinct()