Home > front end >  Align dataframe columns according to row values
Align dataframe columns according to row values

Time:02-08

I have a textfile which i read in as a dataframe where the first row and then every second row, i.e. row 1,3,5 and so on contains a quality score.

Then every second row starting from row 2 contains a count of that given quality score, i.e. row 2,4,6

As an example

      V3       V4       V5       V6       V7       V8       V9      V10       V11       V12 V13 V14 V15 V16 V17 V18 V19 V20 V21 V22 V23 V24 V25
1      2       18       25       27       28       30       31       32        33        34  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA
2     501   14528     21512    27892   30462   34933.     39589    76712    205029  499660 NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA
.
.
.
199     2      13      14       15       16       21       22       23       24       25       26       27       28       29       30       31       32       33       34       35        36        37        38
200  967 5361 9064 24743 24853 25897 26115 30293 31068 33030 35463 37838 40588 45483 45877 51572 52497 59295 72964 85302 102564 125874 499660

In line 1 the quality values range between 2 - 34. Line 199 are another line with qualities and it ranges from 2 - 38 with values between 2 and 18, namely 13,14,15,16 which is not present in line 1. Similar line 199 doesn't contain quality value 18 present in line 1. And as can be seen, column "V4" contains the quality 18 in line 1 and 13 in line 199.

So my issue is I would like to align the counts (so every second line, starting from line 2) according to the qualities (in the line above), such that all columns contain the same quality score.

I would like something along these lines (just showing a few columns)

1      2       13      14       15       16       18 . . .
2     501      NA      NA       NA       NA     14528
.
.
.
199     2       13      14       15       16       18 . . .
200   967      5361    9064    24743    24853      NA

I have read about the apply functions, but I have no idea how to attack this specific problem, and I cannot really find any similar questions

> dput(head(Occurence, 20))
structure(list(V3 = c(2L, 501540L, 18L, 14942133L, 18L, 15333830L, 
15L, 14534463L, 2L, 67356L, 2L, 21283L, 2L, 693L, 2L, 11591L, 
2L, 10735L, 2L, 1357L), V4 = c(18L, 14528493L, 25L, 20570845L, 
25L, 20770770L, 25L, 19772180L, 15L, 14549162L, 14L, 3733153L, 
14L, 3314490L, 14L, 2954483L, 14L, 2582053L, 14L, 2422585L), 
    V5 = c(25L, 21512178L, 27L, 26698655L, 27L, 26229231L, 27L, 
    24871325L, 25L, 19678213L, 15L, 5418224L, 15L, 4892911L, 
    15L, 4475734L, 15L, 3959680L, 15L, 3783535L), V6 = c(27L, 
    27892666L, 28L, 30553444L, 28L, 30238008L, 28L, 28881657L, 
    27L, 24509361L, 16L, 15734383L, 16L, 15922208L, 16L, 15292678L, 
    16L, 14754664L, 16L, 14825386L), V7 = c(28L, 30462569L, 30L, 
    35551769L, 30L, 35425235L, 30L, 34011574L, 28L, 28470507L, 
    24L, 15735995L, 24L, 15936954L, 24L, 15328991L, 24L, 14803665L, 
    24L, 14894503L), V8 = c(30L, 34933739L, 31L, 43332862L, 31L, 
    44100550L, 31L, 43008899L, 30L, 33618696L, 26L, 21724499L, 
    25L, 15937493L, 25L, 15330548L, 25L, 14813003L, 25L, 14912047L
    ), V9 = c(31L, 39589167L, 32L, 74856672L, 32L, 74198843L, 
    32L, 73167369L, 31L, 42602404L, 27L, 22451599L, 26L, 20931741L, 
    26L, 19685005L, 26L, 18821729L, 26L, 18689699L), V10 = c(32L, 
    76712990L, 33L, 188084794L, 33L, 182503127L, 33L, 179841094L, 
    32L, 72983555L, 28L, 27361625L, 27L, 21586370L, 27L, 20346737L, 
    27L, 19445796L, 27L, 19328901L), V11 = c(33L, 205029125L, 
    34L, 499660772L, 34L, 499660772L, 34L, 499660772L, 33L, 181252829L, 
    29L, 32594176L, 28L, 24463465L, 28L, 22726939L, 28L, 21562094L, 
    28L, 21239287L), V12 = c(34L, 499660772L, NA, NA, NA, NA, 
    NA, NA, 34L, 499660772L, 30L, 37445130L, 29L, 28923072L, 
    29L, 26758396L, 29L, 25534527L, 29L, 25347792L), V13 = c(NA, 
    NA, NA, NA, NA, NA, NA, NA, NA, NA, 31L, 43590775L, 30L, 
    30982652L, 30L, 28653896L, 30L, 27153473L, 30L, 26651137L
    ), V14 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 32L, 44152908L, 
    31L, 35625787L, 31L, 33029196L, 31L, 31207384L, 31L, 30528291L
    ), V15 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 33L, 50873330L, 
    32L, 36061896L, 32L, 33422769L, 32L, 31553033L, 32L, 30862392L
    ), V16 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 34L, 63906416L, 
    33L, 42006699L, 33L, 38759787L, 33L, 36348747L, 33L, 35496033L
    ), V17 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 35L, 72684296L, 
    34L, 51548498L, 34L, 46689337L, 34L, 43680971L, 34L, 42730430L
    ), V18 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 36L, 105169117L, 
    35L, 59677004L, 35L, 53868900L, 35L, 50076642L, 35L, 48912839L
    ), V19 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 37L, 171796607L, 
    36L, 75648002L, 36L, 66246037L, 36L, 61636771L, 36L, 60176054L
    ), V20 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 38L, 499660772L, 
    37L, 113735617L, 37L, 90246995L, 37L, 82047368L, 37L, 79133165L
    ), V21 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
    38L, 499660772L, 38L, 499660772L, 38L, 499660772L, 38L, 499660772L
    ), V22 = c(NA_integer_, NA_integer_, NA_integer_, NA_integer_, 
    NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, 
    NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, 
    NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, 
    NA_integer_), V23 = c(NA_integer_, NA_integer_, NA_integer_, 
    NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, 
    NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, 
    NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, 
    NA_integer_, NA_integer_), V24 = c(NA_integer_, NA_integer_, 
    NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, 
    NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, 
    NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, 
    NA_integer_, NA_integer_, NA_integer_), V25 = c(NA_integer_, 
    NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, 
    NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, 
    NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, 
    NA_integer_, NA_integer_, NA_integer_, NA_integer_)), row.names = c(NA, 
20L), class = "data.frame")

CodePudding user response:

This answer makes heavy use of the tidyverse framework. The operations performed in order to transform the example are the following:

  1. assign a pairID to pairs of consecutive rows using a simple function of row_number
  2. split the dataframe according to pairID into a named list of dataframes with names taken from pairID
  3. transform each dataframe and collect them into one big dataframe
  4. sort and reshape the dataframe pivoting on Quality
library(dplyr)
library(tidyr)
library(purrr)

df |>
  mutate(pairID = 3   2*as.integer((row_number() - 1) /2)) %>%
  {split(select(., -pairID), pull(., pairID))} %>%
  purrr::map_dfr(~{data.frame(Quality = unlist(.x[1,]),
                              Occurrence = unlist(.x[2,]))},
                 .id = "pairID") %>%
  na.omit() %>%
  arrange(as.integer(pairID)) %>%
  pivot_wider(values_from = Occurrence,
              names_from = pairID,
              names_prefix = "Row") %>%
  arrange(Quality) |>
  as.data.frame()

##>   Quality      Row3      Row5      Row7      Row9     Row11     Row13
##>1        2    501540        NA        NA        NA     67356     21283
##>2       14        NA        NA        NA        NA        NA   3733153
##>3       15        NA        NA        NA  14534463  14549162   5418224
##>4       16        NA        NA        NA        NA        NA  15734383
##>5       18  14528493  14942133  15333830        NA        NA        NA
##>6       24        NA        NA        NA        NA        NA  15735995
##>7       25  21512178  20570845  20770770  19772180  19678213        NA
##>8       26        NA        NA        NA        NA        NA  21724499
##>9       27  27892666  26698655  26229231  24871325  24509361  22451599
##>10      28  30462569  30553444  30238008  28881657  28470507  27361625
##>11      29        NA        NA        NA        NA        NA  32594176
##>12      30  34933739  35551769  35425235  34011574  33618696  37445130
##>13      31  39589167  43332862  44100550  43008899  42602404  43590775
##>14      32  76712990  74856672  74198843  73167369  72983555  44152908
##>15      33 205029125 188084794 182503127 179841094 181252829  50873330
##>16      34 499660772 499660772 499660772 499660772 499660772  63906416
##>17      35        NA        NA        NA        NA        NA  72684296
##>18      36        NA        NA        NA        NA        NA 105169117
##>19      37        NA        NA        NA        NA        NA 171796607
##>20      38        NA        NA        NA        NA        NA 499660772
##>       Row15     Row17     Row19     Row21
##>1        693     11591     10735      1357
##>2    3314490   2954483   2582053   2422585
##>3    4892911   4475734   3959680   3783535
##>4   15922208  15292678  14754664  14825386
##>5         NA        NA        NA        NA
##>6   15936954  15328991  14803665  14894503
##>7   15937493  15330548  14813003  14912047
##>8   20931741  19685005  18821729  18689699
##>9   21586370  20346737  19445796  19328901
##>10  24463465  22726939  21562094  21239287
##>11  28923072  26758396  25534527  25347792
##>12  30982652  28653896  27153473  26651137
##>13  35625787  33029196  31207384  30528291
##>14  36061896  33422769  31553033  30862392
##>15  42006699  38759787  36348747  35496033
##>16  51548498  46689337  43680971  42730430
##>17  59677004  53868900  50076642  48912839
##>18  75648002  66246037  61636771  60176054
##>19 113735617  90246995  82047368  79133165
##>20 499660772 499660772 499660772 499660772

  •  Tags:  
  • Related