Home > Back-end >  Getting a non-profit's data set Tidy in R; a complex pivot
Getting a non-profit's data set Tidy in R; a complex pivot

Time:09-26

first time question asker so I will try to explain my best. I am trying to tidy up a data set for a non-profit that counts benthic macroinvertebrates (stream bugs). The goal of this data set is to compare ID accuracy between volunteers and biologists. The original data was stored in an extremely complex Excel sheet. I finally have the data concatenated in a large .csv, but I still have an issue where 1 observation (1 observer at 1 moment in time) is in multiple rows. I would like to use pivot functions to get this taken care of, but it's not as cut and dry as some of the how-to's I've been reading. I created a simplified example of my issue here:

   Workbook Location   Taxa    K_Vol K_Bio L_Vol L_Bio V_Vol V_Bio
   <chr>    <chr>      <chr>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1 A        BigCreek   Worm        4     4     2     5     5     3
 2 A        BigCreek   Larvae      4     4     3     2     2     3
 3 A        BigCreek   Swimmer     1     1     2     3     3     3
 4 A        SmallCreek Worm        1     2     2     4     1     5
 5 A        SmallCreek Larvae      1     4     4     2     5     5
 6 A        SmallCreek Swimmer     2     2     3     4     3     1
 7 B        BigCreek   Worm        5     1     3     3     4     1
 8 B        BigCreek   Larvae      5     5     2     5     4     4
 9 B        BigCreek   Swimmer     3     1     4     5     4     4
10 B        SmallCreek Worm        4     1     3     3     4     4
11 B        SmallCreek Larvae      1     4     5     5     2     5
12 B        SmallCreek Swimmer     5     3     3     5     1     1

I would like it to look like this:

 A tibble: 24 × 6
   Workbook Location   Counter  Worm Larvae Swimmer
   <chr>    <chr>      <chr>   <dbl>  <dbl>   <dbl>
 1 A        BigCreek   K_Vol       4      4       1
 2 A        BigCreek   K_Bio       4      4       1
 3 A        BigCreek   L_Vol       2      3       2
 4 A        BigCreek   L_Bio       5      2       3
 5 A        BigCreek   V_Vol       5      2       3
 6 A        BigCreek   V_Bio       3      3       3
 7 A        SmallCreek K_Vol       1      1       2
 8 A        SmallCreek K_Bio       2      4       2
 9 A        SmallCreek L_Vol       2      4       3
10 A        SmallCreek L_Bio       4      2       4
# … with 14 more rows

I have used this to get one column to behave more or less, but I am unsure how to proceed from here as this is a little more of a complicated transformation.

simplecurrent %>%
    pivot_longer(c('K_Vol', 'K_Bio', 'L_Vol', 'L_Bio', 'V_Vol', 'V_Bio'), 
               names_to = "Counter", values_to = 'Worm')  %>%

PS. The pipe at the end of that last bit is from me having the tibble print entirely, not needed here It produces this:

   Workbook Location   Taxa    Counter  Worm
   <chr>    <chr>      <chr>   <chr>   <dbl>
 1 A        BigCreek   Worm    K_Vol       4
 2 A        BigCreek   Worm    K_Bio       4
 3 A        BigCreek   Worm    L_Vol       2
 4 A        BigCreek   Worm    L_Bio       5
 5 A        BigCreek   Worm    V_Vol       5
 6 A        BigCreek   Worm    V_Bio       3
 7 A        BigCreek   Larvae  K_Vol       4
 8 A        BigCreek   Larvae  K_Bio       4
 9 A        BigCreek   Larvae  L_Vol       3
10 A        BigCreek   Larvae  L_Bio       2
11 A        BigCreek   Larvae  V_Vol       2

and so on, all looking good. How do I get the other species to behave like "worm"?

CodePudding user response:

A data.table option using melt and dcast.

library(data.table)

dcast(melt(setDT(simplecurrent), id.vars = c('Workbook', 'Location', 'Taxa'), 
      variable.name = 'Counter'), 
      Workbook   Location   Counter ~ Taxa, value.var = 'value')

#    Workbook   Location Counter Larvae Swimmer Worm
# 1:        A   BigCreek   K_Vol      4       1    4
# 2:        A   BigCreek   K_Bio      4       1    4
# 3:        A   BigCreek   L_Vol      3       2    2
# 4:        A   BigCreek   L_Bio      2       3    5
# 5:        A   BigCreek   V_Vol      2       3    5
# 6:        A   BigCreek   V_Bio      3       3    3
# 7:        A SmallCreek   K_Vol      1       2    1
# 8:        A SmallCreek   K_Bio      4       2    2
# 9:        A SmallCreek   L_Vol      4       3    2
#10:        A SmallCreek   L_Bio      2       4    4
#11:        A SmallCreek   V_Vol      5       3    1
#12:        A SmallCreek   V_Bio      5       1    5
#13:        B   BigCreek   K_Vol      5       3    5
#14:        B   BigCreek   K_Bio      5       1    1
#15:        B   BigCreek   L_Vol      2       4    3
#16:        B   BigCreek   L_Bio      5       5    3
#17:        B   BigCreek   V_Vol      4       4    4
#18:        B   BigCreek   V_Bio      4       4    1
#19:        B SmallCreek   K_Vol      1       5    4
#20:        B SmallCreek   K_Bio      4       3    1
#21:        B SmallCreek   L_Vol      5       3    3
#22:        B SmallCreek   L_Bio      5       5    3
#23:        B SmallCreek   V_Vol      2       1    4
#24:        B SmallCreek   V_Bio      5       1    4
#    Workbook   Location Counter Larvae Swimmer Worm

CodePudding user response:

library(dplyr)
library(tidyr)

df %>% 
  pivot_longer(cols = c(K_Vol:V_Bio),names_to = "Counter") %>% 
  pivot_wider(names_from = Taxa,values_from = value)

# A tibble: 24 x 6
   Workbook Location   Counter  Worm Larvae Swimmer
   <chr>    <chr>      <chr>   <int>  <int>   <int>
 1 A        BigCreek   K_Vol       4      4       1
 2 A        BigCreek   K_Bio       4      4       1
 3 A        BigCreek   L_Vol       2      3       2
 4 A        BigCreek   L_Bio       5      2       3
 5 A        BigCreek   V_Vol       5      2       3
 6 A        BigCreek   V_Bio       3      3       3
 7 A        SmallCreek K_Vol       1      1       2
 8 A        SmallCreek K_Bio       2      4       2
 9 A        SmallCreek L_Vol       2      4       3
10 A        SmallCreek L_Bio       4      2       4
# ... with 14 more rows

-- data

structure(list(Workbook = c("A", "A", "A", "A", "A", "A", "B", 
"B", "B", "B", "B", "B"), Location = c("BigCreek", "BigCreek", 
"BigCreek", "SmallCreek", "SmallCreek", "SmallCreek", "BigCreek", 
"BigCreek", "BigCreek", "SmallCreek", "SmallCreek", "SmallCreek"
), Taxa = c("Worm", "Larvae", "Swimmer", "Worm", "Larvae", "Swimmer", 
"Worm", "Larvae", "Swimmer", "Worm", "Larvae", "Swimmer"), K_Vol = c(4L, 
4L, 1L, 1L, 1L, 2L, 5L, 5L, 3L, 4L, 1L, 5L), K_Bio = c(4L, 4L, 
1L, 2L, 4L, 2L, 1L, 5L, 1L, 1L, 4L, 3L), L_Vol = c(2L, 3L, 2L, 
2L, 4L, 3L, 3L, 2L, 4L, 3L, 5L, 3L), L_Bio = c(5L, 2L, 3L, 4L, 
2L, 4L, 3L, 5L, 5L, 3L, 5L, 5L), V_Vol = c(5L, 2L, 3L, 1L, 5L, 
3L, 4L, 4L, 4L, 4L, 2L, 1L), V_Bio = c(3L, 3L, 3L, 5L, 5L, 1L, 
1L, 4L, 4L, 4L, 5L, 1L)), class = "data.frame", row.names = c(NA, 
-12L))
  • Related