Home > Back-end >  Data cleaning & subsetting in nested list
Data cleaning & subsetting in nested list

Time:02-18

I couldn't find any previous questions which addresses these steps in a nested list. My own attems hasn't got me anywhere either!

I have a nested list df.

  1. I would like to change the column names of the 3 first columns in all data.frames to c("one","two","three").
  2. In each data frame want to keep the 3 first columns and the columns with the same name as the data frame name in the list.
  3. Now each data frame has 4 columns. In each data frame I want to keep the values in the second columns if the values of the fourth column is bigger than 3.
  4. Return a nested list, with the name of each data frame and selected values from the second column (in step 4).

Purrr and dplyr approach are preferred but everything else is much appreciated!

> dput(map_depth(df,1, head))
list(`CD8_C01-LEF1` = structure(list(...1 = c("1236", "6194", 
"51176", "6402", "6137", "1937"), ...2 = c("CCR7", "RPS6", "LEF1", 
"SELL", "RPL13", "EEF1G"), ...3 = c(448.275813024615, 114.565282822255, 
405.993571415472, 352.462886197845, 152.430598462657, 73.5226212775651
), `P-value*` = c(0, 2.35914832807463e-150, 0, 0, 1.03146807397557e-195, 
3.00681346250943e-98), `CD8_C01-LEF1` = c(6.3388353508401, 1.36075129906401, 
5.11667843995657, 5.22902495053118, 1.35703181746742, 1.72815687302818
), `CD8_C02-GPR183` = c(2.71993044636725, 0.755445092850178, 
2.26029822474036, 3.57732840656951, 0.757664532314421, 0.732003573596204
), `CD8_C03-CX3CR1` = c(-2.50016459757821, 0.0430813598361915, 
-1.47763877045973, -1.31104077043168, -0.118054173396857, -0.217984797372657
), `CD8_C04-GZMK` = c(-0.639352384551204, -0.304854019068466, 
-1.400271288872, -1.56965980479594, -0.128422617265835, -0.701864111617954
), `CD8_C05-CD6` = c(-2.35873754058284, -0.115888861319928, -2.08628173736428, 
-3.32630706764402, -0.177640817498698, -0.215754243123614), `CD8_C06-CD160` = c(-2.85558322130952, 
-0.29530343951866, -2.20232116143474, -3.274807762691, -0.440783845861116, 
-0.56207661416919), `CD8_C07-LAYN` = c(-2.75671138163062, -0.887003245107014, 
-2.40845402752497, -3.47698326675668, -1.03656381624963, -1.46468960616135
), `CD8_C08-SLC4A10` = c(-2.68199272253543, 0.0292368512820967, 
-2.1581654239029, -2.99895134853712, 0.0615744908900675, 0.192173783941343
)), row.names = c(NA, 6L), class = "data.frame"), `CD8_C02-GPR183` = structure(list(
    ...1 = c("3575", "4050", "1901", "6653", "1880", "10628"), 
    ...2 = c("IL7R", "LTB", "S1PR1", "SORL1", "GPR183", "TXNIP"
    ), ...3 = c(268.347035159053, 151.397715576146, 423.815475272167, 
    154.131971403975, 161.502687932662, 138.188069200824), `P-value*` = c(0, 
    1.63481853000449e-194, 0, 1.09616441981898e-197, 3.47999420200636e-206, 
    5.87606326954945e-179), `CD8_C01-LEF1` = c(2.25872137515665, 
    1.06433926285014, 2.06890434595653, 1.77222927526522, -2.32256398023726, 
    1.17445992511194), `CD8_C02-GPR183` = c(3.58534594694992, 
    2.33774626980998, 3.1044712936119, 3.00075778716827, 1.54874669286004, 
    2.11053414857411), `CD8_C03-CX3CR1` = c(-2.73122665345433, 
    -3.23251051546321, 2.76359001828421, 0.899851788567591, -3.4595583469893, 
    1.9924219816788), `CD8_C04-GZMK` = c(-1.20359289904198, -2.27859013855459, 
    -0.289843306560729, 0.0930099548084882, 0.293766916539111, 
    -1.05998934689132), `CD8_C05-CD6` = c(0.771026257612103, 
    -1.84446654315228, -1.92859019625536, -0.993527571866541, 
    -0.517242518264243, -1.05505195656161), `CD8_C06-CD160` = c(-1.26433565787961, 
    -3.62072638085859, -1.99838091859197, -2.66224984657089, 
    -3.84677781455005, -0.741084525734145), `CD8_C07-LAYN` = c(-4.85420539962432, 
    -3.79535857695107, -2.07599716553024, -2.41001692585172, 
    -3.66993376805675, -1.90910214659534), `CD8_C08-SLC4A10` = c(1.79563839118781, 
    0.431971358693421, 0.24665792844753, 0.820564247625701, -0.941462395796914, 
    0.224912511574641)), row.names = c(NA, 6L), class = "data.frame"), 
    `CD8_C03-CX3CR1` = structure(list(...1 = c("5341", "1524", 
    "83888", "2214", "343413", "10219"), ...2 = c("PLEK", "CX3CR1", 
    "FGFBP2", "FCGR3A", "FCRL6", "KLRG1"), ...3 = c(372.816216710618, 
    713.554708746553, 575.834099328186, 419.996034284325, 215.715234731706, 
    281.827177706662), `P-value*` = c("0", "0", "0", "0", "3.5450627744914998E-266", 
    "0"), `CD8_C01-LEF1` = c(-1.34745098111019, -0.39476162886016, 
    -0.248194028712413, -0.326944139043036, -0.833877751680806, 
    -0.822668603983214), `CD8_C02-GPR183` = c(0.50737446056126, 
    -0.495638146054913, -0.484905896571723, -0.125753818325312, 
    0.0263098770399738, 0.894340812937189), `CD8_C03-CX3CR1` = c(6.36825282208761, 
    5.38301238794739, 5.26196506464758, 5.6197563760267, 5.8532850807879, 
    5.36851683724817), `CD8_C04-GZMK` = c(1.44463895049283, -0.513803138075432, 
    -0.125340966094923, 0.2447981258131, 1.34537977512099, 2.10784813093189
    ), `CD8_C05-CD6` = c(-0.718776566594413, -0.795121492384525, 
    -0.681892196238474, -0.421395883952147, 0.0987360993173341, 
    -1.35585804120358), `CD8_C06-CD160` = c(-0.550964233191398, 
    -0.794078725052049, -0.707741972359531, -0.156207202527366, 
    2.24842830259497, -1.28977809817504), `CD8_C07-LAYN` = c(0.0641870785667258, 
    -0.785201010640904, -0.631939964779986, -0.340799120353511, 
    0.271892089522186, 0.236064375692484), `CD8_C08-SLC4A10` = c(1.40102283829925, 
    -0.158585496249154, -0.056110756095033, 0.00915832466806331, 
    -0.085141865592199, 3.78847417230501)), row.names = c(NA, 
    6L), class = "data.frame"))

CodePudding user response:

here is a purrr and dplyr solution:

library(tidyverse)

map2(df_list, names(df_list), 
     \(dat, name) {
       dat |>
         select(one = ...1, 
                two = ...2, 
                three = ...3, 
                all_of(name)) |>
         (\(d) filter(d, d[,4] > 3))() |>
         pull(two)
         }
       )
#> $`CD8_C01-LEF1`
#> [1] "CCR7" "LEF1" "SELL"
#> 
#> $`CD8_C02-GPR183`
#> [1] "IL7R"  "S1PR1" "SORL1"
#> 
#> $`CD8_C03-CX3CR1`
#> [1] "PLEK"   "CX3CR1" "FGFBP2" "FCGR3A" "FCRL6"  "KLRG1"

EDIT: explaination

map2 = Here I use this because you have a list of dataframes and map works well with lists. I use the "2" variant because you also want to select the column based on the name of the list.

\(dat, name) = create an anonymous function with the two inputs from map2, where I define the data as dat and the name of the list as name.

select(one = ...1, two = ...2, three = ...3, all_of(name)) = here I select and rename the first three columns as per your request in the question, and I also select the column that is the name of the list with all_of(name). Remember that name is the defined variable name in the anonymous function for the name of the list.

(\(d) filter(d, d[,4] > 3))() = This is a little funky syntax because I like to use the native pipe operator (|>) rather than the magritr pipe operator (%>%). what this means is that I create another anonymous function (\(d)) that defines the current data as d. Then I filter d based on the 4th column being greater than 3 (i.e., d[,4] > 3). If you use the magritr pipe, this can be simplified to filter(.[,4] > 3). Even better would be to use non-standard evaluation to avoid using the anonymous function at all, but I have trouble figuring out the proper use of {{}}, quo, enquo, and !! whith quoted column names.

pull(two) = lastly, we select only the values from the column called two.

EDIT 2: clean up code.

I figured out the non standard eval to clean up the weird syntax.

map2(df_list, names(df_list), 
     \(dat, name) {
       dat |>
         select(one = ...1, 
                two = ...2, 
                three = ...3, 
                all_of(name)) |>
         filter(!!sym(all_of(name)) > 3) |>
         pull(two)
         }
       )
#> $`CD8_C01-LEF1`
#> [1] "CCR7" "LEF1" "SELL"
#> 
#> $`CD8_C02-GPR183`
#> [1] "IL7R"  "S1PR1" "SORL1"
#> 
#> $`CD8_C03-CX3CR1`
#> [1] "PLEK"   "CX3CR1" "FGFBP2" "FCGR3A" "FCRL6"  "KLRG1"

CodePudding user response:

A solution would be:

res <- lapply(setNames(nm = names(df)), function(dfname) {
  dff <- df[[dfname]]

  # only renaming column 2 as columns 1 and 3 are not used later on
  colnames(dff)[2] <- "two" 

  # not 'keeping' the column with the same name as the dataframe, just using the dataframe straightaway   
  dff$two[dff[,dfname] > 3]
})

Note the setNames(...) statement as first argument to lapply. If you send a named list to lapply, it uses the names of the elements as the names of the elements it returns.

  • Related