Home > Software design >  How can I reshape a long dataset into a short data set with multiple variables
How can I reshape a long dataset into a short data set with multiple variables

Time:05-15

**UPDATE

My data set contains 314090 observations in the following format:

UPDATEDID BRIEF_ID gamma LDR_SUM LDR_Topic LDR_7Code
16 04999120040277 2.879744e-03 0.15326902 supervises collective followers very closely 1

Note: there are duplicate BRIEF_ID numbers (3205 unique #s), LDR_Topics (15 unique LDR topics with corresponding LDR_7Codes) so that is why the data is very long.

I'd like to reshape this data whereby each row is a unique #s (3205 rows) and each LDR_Topic (15) is it's own unique column (20 columns total) with its corresponding LDR_SUM as values within the column. For example:

UPDATEDID BRIEF_ID supervises collective followers very closely
16 04999120040277 0.15326902

So far I've tried:

BriefingGammas4<-reshape(data = BriefingGammas3, 
                         idvar = c("UPDATEDID", "BRIEF_ID"),
                         timevar = "LDR_Topic", 
                         direction = "wide")

But it aborts into a new session.

Any advice? Thank you!

***** UPDATE

I tried the following approaches but neither resulted in the correct table.

install.packages("data.table")
library (data.table)
BriefingGammas7 <- as.data.table(BriefingGammas6)
BriefingGammas7 <- dcast(BriefingGammas7, UPDATEDID   BRIEF_ID ~ LDR_Topic, value.var = 'LDR_SUM')

This resulted in the correct 3205 rows, but the incorrect values for each LDR_Topic (they should not be identical, and should be decimals. These numbers seem to reflect LDR_7Code which is not in the dataset). See below for an example:

UPDATEDID BRIEF_ID acquired resources distributed resources enhanced
1 01999110036250 2 4 15
2 01999120041284 2 4 15
3 01999300213 2 4 15

I then tried this:

install.packages("tidyverse")
library (tidyverse)
BriefingGammas6 <- BriefingGammas5 |> 
 pivot_wider(names_from = LDR_Topic, values_from = LDR_SUM) |>
 select(-c(gamma, LDR_7Code))

This resulted in the correct values per LDR_Topic, but the incorrect number of rows. It stayed at 314,090 rows instead of 3205. See example:

UPDATEDID BRIEF_ID acquired resources distributed resources enhanced
1 01999110036250 0.02843241 NA NA
2 01999110036250 NA 0.010892233 NA
3 01999110036250 NA 0.010892233 0.006081761
4 01999110036250 0.02843241 NA 0.006081761

Basically, it filled out the values for each topic for 3205 rows (duplicating it multiple times) and then started filling out the values for the next topic. But I'd like to get the 3205 rows to look like this:

UPDATEDID BRIEF_ID acquired resources distributed resources enhanced
1 01999110036250 0.02843241 0.010892233 0.006081761
2 01999120041284 0.1594207 0.005315201 0.004850703
3 01999300213 0.4374699 0.01607505 0.003971634

The last one I tried was this:

BriefingGammas7<-reshape(data = BriefingGammas6, 
                         idvar = c("UPDATEDID", "BRIEF_ID"),
                         timevar = "LDR_Topic",
                         v.names = "LDR_SUM",
                         direction = "wide")

Which resulted in this:

UPDATEDID BRIEF_ID "acquired resources", "distributed"...
1 01999110036250 NA
2 01999120041284 NA

No other rows came out.

SOLUTION UPDATE*

Step 1. reduce number of variables Step 2. Remove duplicate observations

BriefingGammas7 <- subset(BriefingGammas6, !duplicated(subset(BriefingGammas6, select=c(UPDATEDID, BRIEF_ID, LDR_SUM, LDR_Topic))))

Step 3. use tidy verse way in the comment below.

BriefingGammas8 <- BriefingGammas7 |> 
 pivot_wider(names_from = LDR_Topic, values_from = LDR_SUM)

CodePudding user response:

To make the case clearer, I tried to create a second row with dummy data that follows the pattern of data in the first row:

dput(dat)
structure(list(UPDATEDID = c(16, 17), BRIEF_ID = c("04999120040277", 
"14999120040277"), gamma = c(879.744, 779.744), LDR_SUM = c(0.15326902, 
0.25326902), LDR_Topic = c("supervises collective followers very closely", 
"does something else"), LDR_7Code = c(1, 2)), class = "data.frame", row.names = c(NA, 
-2L))

dat
  UPDATEDID       BRIEF_ID   gamma  LDR_SUM                                    LDR_Topic LDR_7Code
1        16 04999120040277 879.744 0.153269 supervises collective followers very closely         1
2        17 14999120040277 779.744 0.253269                          does something else         2

A base R way

dat |> 
  reshape(direction = "wide", 
          idvar  = "UPDATEDID",
          timevar ="LDR_Topic",
          v.names = "LDR_SUM")|>
  subset(select = -c(gamma, LDR_7Code))

# The result

#  UPDATEDID       BRIEF_ID LDR_SUM.supervises collective followers very closely LDR_SUM.does something else
#1        16 04999120040277                                             0.153269                          NA
#2        17 14999120040277                                                   NA                    0.253269

A tidyverse way

library(tidyverse)

dat |> 
 pivot_wider(names_from = LDR_Topic, values_from = LDR_SUM) |>
 select(-c(gamma, LDR_7Code))

#The result

# A tibble: 2 × 4
#  UPDATEDID BRIEF_ID       `supervises collective followers very closely` `does something else`
#      <dbl> <chr>                                                   <dbl>                 <dbl>
#1        16 04999120040277                                          0.153                NA    
#2        17 14999120040277                                         NA                     0.253

A data.table way (recommended for memory efficiency)

library(data.table)

dat.dt <- as.data.table(dat)
dcast(dat.dt, UPDATEDID   BRIEF_ID ~ LDR_Topic, value.var = 'LDR_SUM')

# The result

#   UPDATEDID       BRIEF_ID does something else supervises collective followers very closely
#1:        16 04999120040277                  NA                                     0.153269
#2:        17 14999120040277            0.253269                                           NA

Updates

Based on your explanation, the tidyverse way basically works on the right direction. The only problem is the duplicated rows that have NAs in some of its columns and you want them to collapse into a single row. This is easy to do with fill() and distinct() functions. The only problem in your example is that the UPDATEDID changed from 1,2,3,4 to 1 with no explanation. Hence, for now, I assume that we can ignore the UPDATEDID (you can create a new column for it later) and we need only to consider BRIEF_ID.

yourdf <- structure(list(UPDATEDID = 1:4, BRIEF_ID = c(1999110036250, 1999110036250, 
1999110036250, 1999110036250), acquired_resources = c(0.02843241, 
NA, NA, 0.02843241), distributed_resources = c(NA, 0.010892233, 
0.010892233, NA), enhanced = c(NA, NA, 0.006081761, 0.006081761
)), class = "data.frame", row.names = c(NA, -4L))

yourdf   # I change the space to '_' to make it easier to control

  UPDATEDID    BRIEF_ID acquired_resources distributed_resources    enhanced
1         1 1.99911e 12         0.02843241                    NA          NA
2         2 1.99911e 12                 NA            0.01089223          NA
3         3 1.99911e 12                 NA            0.01089223 0.006081761
4         4 1.99911e 12         0.02843241                    NA 0.006081761

yourdf[,-1] |>
     fill(acquired_resources,distributed_resources,enhanced, 
     .direction = 'downup') |> 
     distinct()
    

# The result
     BRIEF_ID acquired_resources distributed_resources    enhanced
1 1.99911e 12         0.02843241            0.01089223 0.006081761 

Then, the complete step would be:

dat |> 
 pivot_wider(names_from = LDR_Topic, values_from = LDR_SUM) |>
 select(-c(gamma, LDR_7Code)) |>
 fill(acquired_resources,distributed_resources,enhanced, 
     .direction = 'downup') |> 
     distinct()
  • Related