Home > Software engineering >  How can I read files to DataFrame in Julia with CSV.read when delimiter might be missing?
How can I read files to DataFrame in Julia with CSV.read when delimiter might be missing?

Time:06-20

I have a bunch of files (*.pdb files generated by AlphaFold, for those who work in bio/life science) that I want to read into DataFrames in Julia.

They are typically a few 1000 lines of something like this:

MODEL     1                                                                     
ATOM      1  N   MET B   1      21.976 -98.993 -39.513  1.00 27.24           N  
ATOM      2  CA  MET B   1      23.354 -99.175 -39.063  1.00 27.24           C  
ATOM      3  C   MET B   1      23.767 -98.056 -38.113  1.00 27.24           C  
ATOM      4  CB  MET B   1      24.308 -99.226 -40.258  1.00 27.24           C  
ATOM      5  O   MET B   1      23.315 -96.918 -38.253  1.00 27.24           O  
ATOM      6  CG  MET B   1      24.512-100.624 -40.820  1.00 27.24           C  

I read them to DataFrames with CSV.read using whitespace delimiter and ignorerepeated;

headers = ["Type", "Index", "Atom", "Amino acid type", "Chain", "Amino acid number", "Position X", "Position Y", "Position Z", "Something", "pIDDT", "Atom type"]
types = Dict(:"Type"=>String, :"Index"=>Int64, :"Atom"=>String, :"Amino acid type"=>String, :"Chain"=>Char, :"Amino acid number"=>Int64, :"Position X"=>Float64, :"Position Y"=>Float64, :"Position Z"=>Float64, :"Something"=>Float64, :"pIDDT"=>Float64, :"Atom type"=>Char)
df = CSV.read(file, DataFrame; header=headers, skipto=2, delim=' ', ignorerepeated=true, types=types)

The problem is with some rows, where a whitespace is "missing". In the last line in the example file above, there is no whitespace between column 7 and 8, since the value i column 8 (-100.624) takes up the space in front of it.

This results in something like the below, where the row (now row 6) is offset with some data missing:

Row │ Type    Index   Atom     Amino acid type  Chain  Amino acid number  Position X   Position Y  Position Z  Something  pIDDT       Atom type 
     │ String  Int64?  String?  String?          Char?  Int64?             Float64?     Float64?    Float64?    Float64?   Float64?    Char?     
─────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ ATOM         1  N        MET              B                      1       21.976     -98.993     -39.513       1.0        27.24  N
   2 │ ATOM         2  CA       MET              B                      1       23.354     -99.175     -39.063       1.0        27.24  C
   3 │ ATOM         3  C        MET              B                      1       23.767     -98.056     -38.113       1.0        27.24  C
   4 │ ATOM         4  CB       MET              B                      1       24.308     -99.226     -40.258       1.0        27.24  C
   5 │ ATOM         5  O        MET              B                      1       23.315     -96.918     -38.253       1.0        27.24  O
   6 │ ATOM         6  CG       MET              B                      1  missing         -40.82        1.0        27.24  missing     missing   

I was thinking of pre-formatting the file (line for line, if there's a - with no whitespace in front, add whitespace), but is there a better way?

CodePudding user response:

With CSV.jl I think there is no better solution than pre-formatting. However, since your file is small you probably can just do the following (so that pre-formatting is done as a pre-processing step in RAM):

io = replace(read(file, String), r"(\d)-" => s"\1 -") |> IOBuffer
df = CSV.read(io, DataFrame; header=headers, skipto=2, delim=' ', ignorerepeated=true, types=types)
  • Related