Home > Software engineering >  Mutate data to juxtapose repeat measurments
Mutate data to juxtapose repeat measurments

Time:03-26

Let's pretend I am measuring the distance the distance grasshoppers can jump pre- and post-treatment. This is just for fun, the real measurement could be anything, and the bigger picture is to understand the group_by() command.

For the statistical test I would like to run, each observation needs to have its own column, but I'm given a dataset that is not in this format...!!, and I would like to use the package library(dplyr) , and the command group_by()to shape the data for my needs, because if this were to happen again, I could make a more general code to work over other datasets :)

I am able to do this using commands, such as filter(), and then cbind()at a later step (see example below). But it also requires renaming a column. Additionally, if I wanted to add a column, let's say "difference", to calculate the observed difference between observation 1, and observation 2, I can do this, but then I need to add another line of code (again, see example below)

It would be great to do this with less lines of code

Please see what I have tried, and let me know how I could modify the code group by() to work properly.

example_df <- data.frame( "observation" = character(0), "distance" = integer(0))

Assign names for our "observations", remember, in this example, it's done twice

variable_names <- c( "obs_1", "obs_2")

Assign fictitious values to y

w<-rnorm(200, mean=5, sd=2)
x<-rnorm(200, mean=5, sd=2)
y<-rnorm(200, mean=5, sd=2)
z<-rnorm(200, mean=5, sd=2)

Combine everything for this pretend exercise

df <- data.frame( "observation" = variable_names, "distance" = c(w,x,y,z))
attach(df)

Here's how I achieved the desired results for this example

library(dplyr)

dat = filter(df,observation=="obs_1")
dat2 = filter(df,observation=="obs_2")

names(dat2)
colnames(dat2)[2] <- "distance_2"

final <- cbind(dat,dat2)
attach(final)

final$difference <- distance-distance_2

I tried using the group_by() command, I just get an error message

final <- df %>% group_by(observation,distance) %>% summarise(
  Observation_1 = first(observation), distance_1 = first(distance),
  Observation_2 = last(observation), distance_2 = last(distance,difference=distance-distance_2)))

It would be great to get the above code to work

To make things even more "fun" :), what if more than one variable was measured. Could I make a general code to achieve the desired results, again, without having the go over the whole filter() process, with cbind()etc..

Here's an example (expanded on the above one)



example_df <- data.frame( "observation" = character(0), "distance" = integer(0),"weight" = integer(0),"speed" = integer(0))


variable_names <- c( "obs_1", "obs_2")


w<-rnorm(200, mean=5, sd=2)
x<-rnorm(200, mean=5, sd=2)
y<-rnorm(200, mean=5, sd=2)
z<-rnorm(200, mean=5, sd=2)
a<-rnorm(200, mean=5, sd=2)
b<-rnorm(200, mean=5, sd=2)


df <- data.frame( "observation" = variable_names, "distance" = c(w,x),"weight" = c(y,z),"speed" = c(a,b))

attach(df)

library(dplyr)

dat = filter(df,observation=="obs_1")
dat2 = filter(df,observation=="obs_2")

names(dat2)

colnames(dat2)[2] <- "distance_2"
colnames(dat2)[3] <- "weight_2"
colnames(dat2)[4] <- "speed_2"

final <- cbind(dat,dat2)

attach(final)

final$difference <- distance-distance_2
final$difference_weight <- weight-weight_2
final$difference_speed <- speed-speed_2

Thanks everyone!

CodePudding user response:

Would be simple with pivot_wider, though I presume your data also has an id column to link observations somehow, so have added one here:

library(tidyverse)

w<-rnorm(200, mean=5, sd=2)
x<-rnorm(200, mean=5, sd=2)
y<-rnorm(200, mean=5, sd=2)
z<-rnorm(200, mean=5, sd=2)
a<-rnorm(200, mean=5, sd=2)
b<-rnorm(200, mean=5, sd=2)

variable_names <- c( "obs_1", "obs_2")
df <-
  data.frame(
    "id" = rep(1:200, each = 2),
    "observation" = variable_names,
    "distance" = c(w, x),
    "weight" = c(y, z),
    "speed" = c(a, b)
  )

df %>%
  pivot_wider(
    id_cols = id,
    names_from = observation,
    values_from = distance:speed
  )

#> # A tibble: 200 x 7
#>       id distance_obs_1 distance_obs_2 weight_obs_1 weight_obs_2 speed_obs_1
#>    <int>          <dbl>          <dbl>        <dbl>        <dbl>       <dbl>
#>  1     1           3.63           2.80         2.98       -0.795        3.58
#>  2     2           4.96           6.84         4.11        9.92         8.21
#>  3     3           4.84           7.51         6.32        3.28         9.02
#>  4     4           3.79           6.82         5.42        6.86         7.96
#>  5     5           5.48           2.84         9.56        3.27         3.55
#>  6     6           8.78           2.06         3.81        4.35         5.93
#>  7     7           8.42           4.21         3.92        4.40         9.37
#>  8     8           8.26           9.67         4.05        6.19         3.17
#>  9     9           3.80           4.47         6.58        5.38         6.09
#> 10    10           4.67           2.86         6.27        6.88         3.72
#> # ... with 190 more rows, and 1 more variable: speed_obs_2 <dbl>

Follow-up

You can also tell pivot_wider to use a function in combining values. Here in this example I've passed names_from = NULL so that every column is paired up by id, and using the diff function to calculate the difference:

df %>%
  pivot_wider(
    id_cols = id,
    names_from = NULL,
    values_from = distance:speed,
    values_fn = diff,
    names_sep = ""
  )

#> # A tibble: 200 x 4
#>       id distance weight  speed
#>    <int>    <dbl>  <dbl>  <dbl>
#>  1     1   -0.828 -3.77   4.45 
#>  2     2    1.88   5.82  -1.07 
#>  3     3    2.66  -3.04  -4.31 
#>  4     4    3.03   1.45  -0.969
#>  5     5   -2.64  -6.29   5.06 
#>  6     6   -6.72   0.541 -2.24 
#>  7     7   -4.20   0.481 -5.82 
#>  8     8    1.41   2.14   3.71 
#>  9     9    0.669 -1.19  -1.14 
#> 10    10   -1.81   0.607 -2.62 
#> # ... with 190 more rows

Created on 2022-03-25 by the reprex package (v2.0.1)

  • Related