I am trying to download Single Cell matrices from a data set from the Gene Expression Omnibus but all the links have unique addresses. I wrote a function to try a combination of numbers in the URL until if found. The addresses differ by one number so I wrote this.
file_number <- as.character(29:38) # SRA file numbers
for (i in 1:10)
{
data <- tryCatch(
lapply(file_number[1], function(x) {
base_url <- paste0(
'ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM4715nnn/GSM47154', x,
'/suppl/GSM47154', x,'_P', i,
'.expression_matrix.txt.gz')
# Download data to temporary directory
temp <- tempfile()
download.file(base_url, temp)
gzfile(temp, 'rt')
counts <- read.table(file = temp, row.names = 1)
unlink(temp)}),
error = function(e) { skip_to_next <<- TRUE})
if(skip_to_next) { next }
else print(data)
}
The code runs but it will not return the data frame from the correct link which should be ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM4715nnn/GSM4715429/suppl/GSM4715429_P1.expression_matrix.txt.gz
.
CodePudding user response:
Since you run nested loops and need to return back data, consider nested lapply
calls without any skip instructions. Simply add return
lines and even output errors with corresponding URL attempt:
file_numbers <- as.character(29:38) # SRA file numbers
df_list <- lapply(1:10, function(i) {
lapply(file_numbers, function(x) {
tryCatch({
print(x)
# Build URL
base_url <- paste0(
'ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM4715nnn/GSM47154', x,
'/suppl/GSM47154', x, '_P', i,
'.expression_matrix.txt.gz'
)
# Download data to temporary directory
temp <- tempfile()
download.file(base_url, temp)
gzfile(temp, 'rt')
# Read in data
counts <- read.table(file = temp, row.names = 1)
unlink(temp)
return(counts) # RETURN DATA ON SUCCESS
},
error = function(e) {
print(e) # OUTPUT MESSAGE TO CONSOLE
return(NULL) # RETURN NULL ON ERROR
})
})
})
# REMOVE NULL ELEMENTS
df_list <- lapply(df_list, function(dfs) Filter(NROW, dfs))
Output (for first three of lapply calls):
Before Filter
lapply(df_list, \(x) lapply(x, \(d) d[1:5, 1:5]))
[[1]]
[[1]][[1]]
GTTCTTAATCTG AGCCTCAACGCC CTGTCCTTCATG CCCCAGACGCTA CGCCCCAACTTA
ARPC2 72 88 62 78 90
RPS2 48 64 60 43 55
SNX10 22 36 29 26 25
B2M 660 620 684 611 672
LYZ 242 232 264 273 193
[[1]][[2]]
NULL
[[1]][[3]]
NULL
[[2]]
[[2]][[1]]
NULL
[[2]][[2]]
CCGAGTCCCTGT CCCCCGGTGATG TTGCATATCACT TCGTCGGAAGGG CTATCCCGGGCT
KIF22 0 1 0 0 3
B2M 389 69 468 260 142
ADD1 0 2 0 4 0
IFITM3 6 22 13 5 14
SUPT16H 0 2 0 5 2
[[2]][[3]]
NULL
[[3]]
[[3]][[1]]
NULL
[[3]][[2]]
NULL
[[3]][[3]]
AGATCATACCTT GCGAAGTCGCGA CGGGCGGCCGCA TTACCATGTCTT AAGTCAGGCGTC
SDAD1 1 4 4 1 4
COX7C 54 36 35 26 31
RPS24 76 74 79 32 51
ACIN1 10 4 5 4 3
GLO1 6 2 3 5 4
After Filter
lapply(df_list, \(x) lapply(x, \(d) d[1:5, 1:5]))
[[1]]
[[1]][[1]]
GTTCTTAATCTG AGCCTCAACGCC CTGTCCTTCATG CCCCAGACGCTA CGCCCCAACTTA
ARPC2 72 88 62 78 90
RPS2 48 64 60 43 55
SNX10 22 36 29 26 25
B2M 660 620 684 611 672
LYZ 242 232 264 273 193
[[2]]
[[2]][[1]]
CCGAGTCCCTGT CCCCCGGTGATG TTGCATATCACT TCGTCGGAAGGG CTATCCCGGGCT
KIF22 0 1 0 0 3
B2M 389 69 468 260 142
ADD1 0 2 0 4 0
IFITM3 6 22 13 5 14
SUPT16H 0 2 0 5 2
[[3]]
[[3]][[1]]
AGATCATACCTT GCGAAGTCGCGA CGGGCGGCCGCA TTACCATGTCTT AAGTCAGGCGTC
SDAD1 1 4 4 1 4
COX7C 54 36 35 26 31
RPS24 76 74 79 32 51
ACIN1 10 4 5 4 3
GLO1 6 2 3 5 4