Home > Software engineering >  Convert table 4x16 in a 4x4 by agrouping similar column values
Convert table 4x16 in a 4x4 by agrouping similar column values

Time:10-13

Good night everyone: First of all, thank you for your time and for your effort answering this question. I have the following data:

 aaa  aac  aag  aat  aca  acc  acg  act  aga  agc  agg  agt  ata  atc  atg  att  caa  cac  cag  cat  cca  ccc  ccg  cct 
1501  895  660  989  759  858  658  524  667  878  608  621  897  983  901 1030  669  713 1381  708  610  514 1308  542 
 cga  cgc  cgg  cgt  cta  ctc  ctg  ctt  gaa  gac  gag  gat  gca  gcc  gcg  gct  gga  ggc  ggg  ggt  gta  gtc  gtg  gtt 
 516  888 1398  664  196  410 1371  523 1071  624  519 1118 1045  881 1007  835  894 1037  734  915  618  660  862  863 
 taa  tac  tag  tat  tca  tcc  tcg  tct  tga  tgc  tgg  tgt  tta  ttc  ttg  ttt 
 805  567  214  996 1057  721  493  599 1255  965  840  803  871  816  729 1364 

From here I have created the following table in R:

    aa  ac   ag   at   ca  cc   cg  ct   ga   gc   gg  gt  ta  tc   tg   tt
a 1501 895  660  989  759 858  658 524  667  878  608 621 897 983  901 1030
c  669 713 1381  708  610 514 1308 542  516  888 1398 664 196 410 1371  523
g 1071 624  519 1118 1045 881 1007 835  894 1037  734 915 618 660  862  863
t  805 567  214  996 1057 721  493 599 1255  965  840 803 871 816  729 1364 

The code used is this:

triple <- count(morseq,wordsize=3)
triple
R <- matrix(triple,4,16,byrow=T)
colnames(R) <- c("aa","ac","ag","at","ca","cc","cg","ct","ga","gc","gg","gt","ta","tc","tg","tt") ## asigna nombre a las columnas
row.names(R) <- c("a","c","g","t")

I want to simplify the table above by converting it to a 4x4 table where the columns represent the third letter and the rows represents the first letter. Therefore, I have to sum the values where the first and final letter are the same from the 4x16 table and I don't know how to start doing it.

The final result should be something like this:

    a   c    g    t   
a   x   x    x    x  
c   x   x    x    x  
g   x   x    x    x  
t   x   x    x    x 

Sinceresly,

CodePudding user response:

One option would be to first reshape to long using tidyr::pivot_longer and to afterwards use summarise to compute the sum:

library(tidyr)
library(dplyr)
library(tibble)

pivot_longer(R, everything(), names_to = c(".value", "name"), names_pattern = "(.)(.)") |>
  group_by(name) |>
  summarise(across(everything(), sum)) |>
  column_to_rownames("name")
#>      a    c    g    t
#> a 4046 3471 3332 2582
#> c 2799 2974 3768 2869
#> g 2774 3466 3580 3863
#> t 3811 2500 3003 3780

DATA

R <- structure(list(aa = c(1501L, 669L, 1071L, 805L), ac = c(
  895L,
  713L, 624L, 567L
), ag = c(660L, 1381L, 519L, 214L), at = c(
  989L,
  708L, 1118L, 996L
), ca = c(759L, 610L, 1045L, 1057L), cc = c(
  858L,
  514L, 881L, 721L
), cg = c(658L, 1308L, 1007L, 493L), ct = c(
  524L,
  542L, 835L, 599L
), ga = c(667L, 516L, 894L, 1255L), gc = c(
  878L,
  888L, 1037L, 965L
), gg = c(608L, 1398L, 734L, 840L), gt = c(
  621L,
  664L, 915L, 803L
), ta = c(897L, 196L, 618L, 871L), tc = c(
  983L,
  410L, 660L, 816L
), tg = c(901L, 1371L, 862L, 729L), tt = c(
  1030L,
  523L, 863L, 1364L
)), class = "data.frame", row.names = c(
  "a",
  "c", "g", "t"
))

CodePudding user response:

Assuming that you have a table rather than a data frame, you can do:

library(tidyverse) 

tab %>%
  as.data.frame() %>%
  mutate(Var2 = substr(Var2, 2, 2)) %>%
  group_by(Var1, Var2) %>%
  summarise(Freq = sum(Freq), .groups = "drop") %>%
  pivot_wider(names_from = Var2, values_from = Freq) %>%
  column_to_rownames("Var1")
#>      a    c    g    t
#> a 3824 3614 2827 3164
#> c 1991 2525 5458 2437
#> g 3628 3202 3122 3731
#> t 3988 3069 2276 3762

Data

tab <- structure(c(1501L, 669L, 1071L, 805L, 895L, 713L, 624L, 567L, 
660L, 1381L, 519L, 214L, 989L, 708L, 1118L, 996L, 759L, 610L, 
1045L, 1057L, 858L, 514L, 881L, 721L, 658L, 1308L, 1007L, 493L, 
524L, 542L, 835L, 599L, 667L, 516L, 894L, 1255L, 878L, 888L, 
1037L, 965L, 608L, 1398L, 734L, 840L, 621L, 664L, 915L, 803L, 
897L, 196L, 618L, 871L, 983L, 410L, 660L, 816L, 901L, 1371L, 
862L, 729L, 1030L, 523L, 863L, 1364L), dim = c(4L, 16L), dimnames = list(
c("a", "c", "g", "t"), c("aa", "ac", "ag", "at", "ca", "cc", 
"cg", "ct", "ga", "gc", "gg", "gt", "ta", "tc", "tg", "tt"
)), class = "table")

tab
#>     aa   ac   ag   at   ca   cc   cg   ct   ga   gc   gg   gt   ta   tc   tg   tt
#> a 1501  895  660  989  759  858  658  524  667  878  608  621  897  983  901 1030
#> c  669  713 1381  708  610  514 1308  542  516  888 1398  664  196  410 1371  523
#> g 1071  624  519 1118 1045  881 1007  835  894 1037  734  915  618  660  862  863
#> t  805  567  214  996 1057  721  493  599 1255  965  840  803  871  816  729 1364

Created on 2022-10-12 with reprex v2.0.2

  •  Tags:  
  • r
  • Related