Transform dataset and summarize entries of column into a row of data-CodePudding

I have a dataset that contains data in the following manner:

Name, Result, Date #A lot of other stuff as we, but these are the only relevant ones I think
Peter Parker, 150, 2018-03-03
Peter Parker, 155, 2018-03-04
Peter Parker, 156, 2018-03-05
Peter Parker, 154, 2018-03-06
Peter Parker, 158, 2018-03-07
Benny Thompson, 130, 2018-03-03
Benny Thompson, 132, 2018-03-04
Benny Thompson, 138, 2018-03-05
Benny Thompson, 140, 2018-03-07
Benny Thompson, 139, 2018-03-09
Mylo Thony, 177, 2018-03-11

some of the persons are present at least, lets say 5 times. I want to create a model to predict the 5th result with regression if I give it the first four. I therefore assume that all Persons are "behaving" equally and want to transform my data into this format:

150, 155, 156, 154, 158
130, 132, 138, 140, 139

and that only for all persons who at least have 5 results in the record. I have no idea where to start, I come from the Java and C department, usually I would just run a for loop over this, but this seems very un-R-like to me, at least I have never seen anything like this yet. What is the best approach to something like this?

CodePudding user response：

As far as I understood, the following steps have to be done:

remove rows from participants with less than 5 observations
transfer data to wide format

Removing rows

By using the with function from base R you can remove the participants with less than 5 observations.

x <- data.frame(Name=c("Peter Parker", "Peter Parker", "Peter Parker", "Peter Parker","Peter Parker", "Benny Thompson", "Benny Thompson", "Benny Thompson", "Benny Thompson", "Benny Thompson", "Mylo Thony"),
           Result=c(150,155, 156, 154, 158,130, 132, 138, 140, 139, 177))
x <- x[with(x, Name %in% names(which(table(Name)>=5))), ]

Transfer to wide format

Your data is currently in long format. It can be easily transfered to wide format using the spread function from the tidyr package. The spread function requires a key column which contains the new column names. If there are only 5 observations per person, you can just add a new column repeating the numbers 1 to 5 as often as there are unique values in your Name column (calculated with n_distinct from the dplyr package). Note that your data should be ordered by both Name and Date, so that the numbers are assigned to the correct observation.

 x$Measurement <- rep(c(1,2,3,4,5), dplyr::n_distinct(x$Name))
 x_wide <- tidyr::spread(x, Measurement, Result)

If you have more than 5 observations per person, the solution has to be modified.

CodePudding user response：

Perhaps "row" format is not the best way to perform predictions on your data. I'll give an example with dplyr/tidyr for the data in one row case:

The data:

data <- data.frame(Name = c("Peter Parker", "Peter Parker", "Peter Parker", 
"Peter Parker", "Peter Parker", "Benny Thompson", "Benny Thompson", 
"Benny Thompson", "Benny Thompson", "Benny Thompson", "Mylo Thony"
), Result = c(150L, 155L, 156L, 154L, 158L, 130L, 132L, 138L, 
140L, 139L, 177L), Date = c(" 2018-03-03", " 2018-03-04", " 2018-03-05", 
" 2018-03-06", " 2018-03-07", " 2018-03-03", " 2018-03-04", " 2018-03-05", 
" 2018-03-07", " 2018-03-09", " 2018-03-11"))

data
#>              Name Result        Date
#> 1    Peter Parker    150  2018-03-03
#> 2    Peter Parker    155  2018-03-04
#> 3    Peter Parker    156  2018-03-05
#> 4    Peter Parker    154  2018-03-06
#> 5    Peter Parker    158  2018-03-07
#> 6  Benny Thompson    130  2018-03-03
#> 7  Benny Thompson    132  2018-03-04
#> 8  Benny Thompson    138  2018-03-05
#> 9  Benny Thompson    140  2018-03-07
#> 10 Benny Thompson    139  2018-03-09
#> 11     Mylo Thony    177  2018-03-11

To filter and transforming the data

library(dplyr)
library(tidyr)

data_in_one_row <- data |> group_by(Name) |> 
  mutate(count = n(), id = 1:n()) |>
  filter(count == 5)  |>   
  pivot_wider(id_cols = c(Name), names_from = id, 
              values_from = -c(Name, id))

data_in_one_row

#> # A tibble: 2 × 16
#> # Groups:   Name [2]
#>   Name       Resul…¹ Resul…² Resul…³ Resul…⁴ Resul…⁵ Date_1 Date_2 Date_3 Date_4
#>   <chr>        <int>   <int>   <int>   <int>   <int> <chr>  <chr>  <chr>  <chr> 
#> 1 Peter Par…     150     155     156     154     158 " 201… " 201… " 201… " 201…
#> 2 Benny Tho…     130     132     138     140     139 " 201… " 201… " 201… " 201…
#> # … with 6 more variables: Date_5 <chr>, count_1 <int>, count_2 <int>,
#> #   count_3 <int>, count_4 <int>, count_5 <int>, and abbreviated variable names
#> #   ¹Result_1, ²Result_2, ³Result_3, ⁴Result_4, ⁵Result_5

Eventually you can apply a single linear mode prediction, depending on your expectations (on the un transformed dataset).

data <- data |> group_by(Name) |> 
  mutate(count = n(), id = 1:n()) |>
  filter(count == 5) 

model <- lm(Result ~ Name   as.POSIXct(Date), data)

pred_dataset <- data |> group_by(Name) |> 
  summarise(Date = max(as.POSIXct(Date))   24 * 3600)

pred_dataset$prediction = predict(model, pred_dataset)

pred_dataset
#> # A tibble: 3 × 3
#>   Name           Date                prediction
#>   <chr>          <dttm>                   <dbl>
#> 1 Benny Thompson 2018-03-10 00:00:00       143.
#> 2 Mylo Thony     2018-03-12 00:00:00       179.
#> 3 Peter Parker   2018-03-08 00:00:00       159.