Home > Software design >  How to convert large .csv file with "too many columns" into SQL database
How to convert large .csv file with "too many columns" into SQL database

Time:04-15

I was given a large .csv file (around 6.5 Gb) with 25k rows and 20k columns. Let's call first column ID1 and then each additional column is a value for each of these ID1s in different conditions. Let's call these ID2s.

This is the first time I work with such large files. I wanted to process the .csv file in R and summarize the values, mean, standard deviation and coefficient of variation for each ID1.

My idea was to read the file directly (with datatable fread), convert it into "long" data (with dplyr) so I have three columns: ID1, ID2 and value. Then group them by ID1,ID2 and summarize. However, I do not seem to have enough memory to read the file (I assume R uses more memory than the file's size to store it).

I think it would be more efficient to first convert the file into a SQL database and then process it from there. I have tried to convert it using sqlite3 but it gives me an error message stating that the maximum number of columns to read are 4096.

I have no experience with SQL, so I was wondering what would be the best way of converting the .csv file into a database. I guess reading each column and storing them as a table or something like that would work.

I have searched for similar questions but most of them just say that having so many columns is a bad db design. I cannot generate the .csv file with a proper structure.

Any suggestions for an efficient way of processing the .csv file?

Best,

CodePudding user response:

Pivoting in SQL is very tedious and often requires writing nested queries for each column. SQLite3 is indeed the way to go if the data can not live in the RAM. This code will read the text file in chunks, pivot the data in long format and puts it into the SQL database. Then you can access the database with dplyr verbs for summarizing. This uses another example dataset, because I have no idea which column types ID1 and ID2 have. You might want to do pivot_longer(-ID2) to have two name columns.

library(tidyverse)
library(DBI)
library(vroom)

conn <- dbConnect(RSQLite::SQLite(), "my-db.sqlite")
dbCreateTable(conn, "data", tibble(name = character(), value = character()))

file <- "https://github.com/r-lib/vroom/raw/main/inst/extdata/mtcars.csv"
chunk_size <- 10 # read this many lines of the text file at once
n_chunks <- 5

# start with offset 1 to ignore header
for(chunk_offset in seq(1, chunk_size * n_chunks, by = chunk_size)) {
  # everything must be character to allow pivoting numeric and  text columns
  vroom(file, skip = chunk_offset, n_max = chunk_size,
    col_names = FALSE, col_types = cols(.default = col_character())
  ) %>%
    pivot_longer(everything()) %>%
    dbAppendTable(conn, "data", value = .)
}

data <- conn %>% tbl("data") 
data
#> # Source:   table<data> [?? x 2]
#> # Database: sqlite 3.37.0 [my-db.sqlite]
#>    name  value    
#>    <chr> <chr>    
#>  1 X1    Mazda RX4
#>  2 X2    21       
#>  3 X3    6        
#>  4 X4    160      
#>  5 X5    110      
#>  6 X6    3.9      
#>  7 X7    2.62     
#>  8 X8    16.46    
#>  9 X9    0        
#> 10 X10   1        
#> # … with more rows

data %>%
  # summarise only the 3rd column
  filter(name == "X3") %>%
  group_by(value) %>%
  count() %>%
  arrange(-n) %>%
  collect()
#> # A tibble: 3 × 2
#>   value     n
#>   <chr> <int>
#> 1 8        14
#> 2 4        11
#> 3 6         7

Created on 2022-04-15 by the reprex package (v2.0.1)

CodePudding user response:

An alternative that allows you to bypass R (for reshaping and storing the data) and pivot a CSV in bash itself.

First, I'll start with a known wide-format csv (quux.csv):

"ID","V1","V2","V3","V4","V5"
1,1,2,3,4,5
2,6,7,8,9,10
3,11,12,13,14,15
4,16,17,18,19,20

From here, a small hard-coded bash script that preserves the first column and pivots the rest (this script is named question.bash):

#!/bin/bash
unset HDR
while IFS=, read -a FIELDS ; do
    if [[ -z "${HDR}" ]]; then
        echo "${FIELDS[0]},key,val"
        HDR=(${FIELDS[@]})
    else
        for i in `seq 1 $(( ${#FIELDS[@]} - 1 ))` ; do
            echo "${FIELDS[0]},${HDR[$i]},${FIELDS[$i]}"
        done
    fi
done

Running it and the results:

$ ./question.bash < quux.csv > quux-wide.csv
$ cat quux-wide.csv
"ID",key,val
1,"V1",1
1,"V2",2
1,"V3",3
1,"V4",4
1,"V5",5
2,"V1",6
2,"V2",7
2,"V3",8
2,"V4",9
2,"V5",10
3,"V1",11
3,"V2",12
3,"V3",13
3,"V4",14
3,"V5",15
4,"V1",16
4,"V2",17
4,"V3",18
4,"V4",19
4,"V5",20

And import into a sqlite3 database:

$ sqlite3 quux.sqlite '.mode csv' '.import quux-wide.csv newtable'

Confirm that it is there:

$ sqlite quux.sqlite
sqlite> .headers on
sqlite> select * from newtable limit 3;
ID,key,val
1,V1,1
1,V2,2
1,V3,3

This can then be access in R using DBI and RSQLite:

con <- DBI::dbConnect(RSQLite::SQLite(), "quux.sqlite")
DBI::dbGetQuery(con, "select * from newtable limit 3")
#   ID key val
# 1  1  V1   1
# 2  1  V2   2
# 3  1  V3   3

Note: This is not fast, but it still works. The bash script alone will take minutes if not an hour or two to pivot the data. For that, this script may be more of a curiosity than a practical use. However, if nothing else works then this might be a reasonable fallback.

  • Related