Home > Net >  Loading CSV in Julia fails
Loading CSV in Julia fails

Time:08-24

I am trying to read a large DataFrame in Julia with CSV.read(file, DataFrame) and receive the following error:

ERROR: TaskFailedException

    nested task error: BoundsError: attempt to access 308-element Vector{BigFloat} at index [0]
    Stacktrace:
      [1] getindex
        @ .\array.jl:861 [inlined]
      [2] _scale(#unused#::Type{Float64}, v::BigInt, exp::Int64, neg::Bool)
        @ Parsers C:\Users\mazoi\.julia\packages\Parsers\KmPKe\src\floats.jl:524
      [3] scale(#unused#::Type{Float64}, v::BigInt, exp::Int64, neg::Bool)
        @ Parsers C:\Users\mazoi\.julia\packages\Parsers\KmPKe\src\floats.jl:408
      [4] parseexp
        @ C:\Users\mazoi\.julia\packages\Parsers\KmPKe\src\floats.jl:356 [inlined]
      [5] parsefrac
        @ C:\Users\mazoi\.julia\packages\Parsers\KmPKe\src\floats.jl:320 [inlined]
      [6] parsedigits
        @ C:\Users\mazoi\.julia\packages\Parsers\KmPKe\src\floats.jl:251 [inlined]
      [7] _parsedigits(#unused#::Type{Float64}, source::Vector{UInt8}, pos::Int64, len::Int64, b::UInt8, code::Int16, options::Parsers.Options, digits::BigInt, neg::Bool, startpos::Int64)
        @ Parsers C:\Users\mazoi\.julia\packages\Parsers\KmPKe\src\floats.jl:186
      [8] parsedigits
        @ C:\Users\mazoi\.julia\packages\Parsers\KmPKe\src\floats.jl:210 [inlined]
      [9] _parsedigits(#unused#::Type{Float64}, source::Vector{UInt8}, pos::Int64, len::Int64, b::UInt8, code::Int16, options::Parsers.Options, digits::UInt128, neg::Bool, startpos::Int64)
        @ Parsers C:\Users\mazoi\.julia\packages\Parsers\KmPKe\src\floats.jl:186
     [10] parsedigits
        @ C:\Users\mazoi\.julia\packages\Parsers\KmPKe\src\floats.jl:210 [inlined]
     [11] typeparser
        @ C:\Users\mazoi\.julia\packages\Parsers\KmPKe\src\floats.jl:179 [inlined]
     [12] xparse(::Type{Float64}, source::Vector{UInt8}, pos::Int64, len::Int64, options::Parsers.Options, ::Type{Float64})
        @ Parsers C:\Users\mazoi\.julia\packages\Parsers\KmPKe\src\Parsers.jl:316
     [13] xparse
        @ C:\Users\mazoi\.julia\packages\Parsers\KmPKe\src\Parsers.jl:266 [inlined]
     [14] detect
        @ C:\Users\mazoi\.julia\packages\CSV\jFiCn\src\utils.jl:470 [inlined]
     [15] detect
        @ C:\Users\mazoi\.julia\packages\CSV\jFiCn\src\utils.jl:459 [inlined]
     [16] findchunkrowstart(ranges::Vector{Int64}, i::Int64, buf::Vector{UInt8}, opts::Parsers.Options, typemap::Dict{Type, Type}, downcast::Bool, ncols::Int64, rows_to_check::Int64, columns::Vector{CSV.Column}, origcoltypes::Vector{Type}, columnlock::ReentrantLock, stringtype::Any, totalbytes::Base.Threads.Atomic{Int64}, totalrows::Base.Threads.Atomic{Int64}, succeeded::Base.Threads.Atomic{Bool})
        @ CSV C:\Users\mazoi\.julia\packages\CSV\jFiCn\src\detection.jl:383
     [17] macro expansion
        @ C:\Users\mazoi\.julia\packages\CSV\jFiCn\src\detection.jl:470 [inlined]
     [18] (::CSV.var"#16#17"{Vector{UInt8}, Parsers.Options, Vector{Int64}, Int64, Vector{CSV.Column}, DataType, Dict{Type, Type}, Bool, Int64, Vector{Type}, ReentrantLock, Base.Threads.Atomic{Bool}, Base.Threads.Atomic{Int64}, Base.Threads.Atomic{Int64}, Int64})()
        @ CSV .\threadingconstructs.jl:178
Stacktrace:
 [1] sync_end(c::Channel{Any})
   @ Base .\task.jl:381
 [2] macro expansion
   @ .\task.jl:400 [inlined]
 [3] findrowstarts!(buf::Vector{UInt8}, opts::Parsers.Options, ranges::Vector{Int64}, ncols::Int64, columns::Vector{CSV.Column}, stringtype::Any, typemap::Dict{Type, Type}, downcast::Bool, rows_to_check::Int64)
   @ CSV C:\Users\mazoi\.julia\packages\CSV\jFiCn\src\detection.jl:468
 [4] CSV.Context(source::CSV.Arg, header::CSV.Arg, normalizenames::CSV.Arg, datarow::CSV.Arg, skipto::CSV.Arg, footerskip::CSV.Arg, transpose::CSV.Arg, comment::CSV.Arg, ignoreemptyrows::CSV.Arg, ignoreemptylines::CSV.Arg, select::CSV.Arg, drop::CSV.Arg, limit::CSV.Arg, buffer_in_memory::CSV.Arg, threaded::CSV.Arg, ntasks::CSV.Arg, tasks::CSV.Arg, rows_to_check::CSV.Arg, lines_to_check::CSV.Arg, missingstrings::CSV.Arg, missingstring::CSV.Arg, delim::CSV.Arg, ignorerepeated::CSV.Arg, quoted::CSV.Arg, quotechar::CSV.Arg, openquotechar::CSV.Arg, closequotechar::CSV.Arg, escapechar::CSV.Arg, dateformat::CSV.Arg, dateformats::CSV.Arg, decimal::CSV.Arg, truestrings::CSV.Arg, falsestrings::CSV.Arg, stripwhitespace::CSV.Arg, type::CSV.Arg, types::CSV.Arg, typemap::CSV.Arg, pool::CSV.Arg, downcast::CSV.Arg, lazystrings::CSV.Arg, stringtype::CSV.Arg, strict::CSV.Arg, silencewarnings::CSV.Arg, maxwarnings::CSV.Arg, debug::CSV.Arg, parsingdebug::CSV.Arg, validate::CSV.Arg, streaming::CSV.Arg)
   @ CSV C:\Users\mazoi\.julia\packages\CSV\jFiCn\src\context.jl:608
 [5] #File#25
   @ C:\Users\mazoi\.julia\packages\CSV\jFiCn\src\file.jl:221 [inlined]
 [6] CSV.File(source::String)
   @ CSV C:\Users\mazoi\.julia\packages\CSV\jFiCn\src\file.jl:221
 [7] top-level scope
   @ REPL[16]:1

I think I can pinpoint the error to some rows that have strange mismatching types. I can read the data in Python and the problematic columns show \\N.

I tried to get around the issue by specifying escapechar='\\' but it didn't help. Any ways to get around these problematic rows or find a more robust package than CSV.jl? Thanks!

Update: added full Stacktrace of the error.

CodePudding user response:

If your file has fields that start with several digits (eg. SHA digest values), this may be due to this issue. If so, a fix has been added to CSV.jl, but isn't available in a release yet.

In any case, the error occurs during parsing done for multithreaded processing of the file, so you can avoid it by passing ntasks = 1 to the CSV.read call.

  • Related