how to make csv file run faster in c-CodePudding

I have a million entries in a CSV file and need to load it. However, it takes around 2 minutes to complete the loading. I need to fix this problem to make the data load faster. Is that any better way can figure it out? so I can research and try fixing it. Thank you for any help.

CSVReader.h

#pragma once
#include "OrderBookEntry.h"
#include <vector>
#include <string>

class CSVReader
{
public:
CSVReader();
static std::vector<OrderBookEntry> readCSV (std::string csvFile);
};

CSVReader.cpp

#include "CSVReader.h"
#include <iostream>
#include <fstream>

CSVReader::CSVReader() {

}

std::vector<OrderBookEntry> CSVReader::readCSV(std::string csvFilename) {
  std::vector<OrderBookEntry> entries;
  std::ifstream csvFile{csvFilename};
  std::string line;
  if (csvFile.is_open()) 
  {
      while (std::getline(csvFile, line))   
      { 
        
        try {
            OrderBookEntry obe = stringsToOBE(tokenise(line, ','));
            entries.push_back(obe);
        }
        catch(const std::exception& e){
            std::cout << "CSVReader::readCSV bad data" << std::endl;
        }
        
    }//end of while
}
std::cout << "Successfully read " << entries.size() << " entries" << std::endl;
return entries;
}

std::vector<std::string> CSVReader::tokenise(std::string csvLine, char separator) {
   std::vector<std::string>tokens;
   signed int start, end;
   std::string token;
   start = csvLine.find_first_not_of(separator, 0);
   do {
     end = csvLine.find_first_of(separator, start);
     if (start == csvLine.length() || start == end) break;
     if (end >= 0) token = csvLine.substr(start, end - start);
     else token = csvLine.substr(start, csvLine.length() - start);
     tokens.push_back(token);
     start = end   1;
   } while (end > 0);

   return tokens;
}

OrderBookEntry CSVReader::stringsToOBE(std::vector<std::string>tokens) {
    double price, amount;
     if (tokens.size() != 5) {
        std::cout << "Bad Input" << std::endl;
        throw std::exception{};
    }
    try {
        //we have 5 tokens
        price = std::stod(tokens[3]);
        amount = std::stod(tokens[4]);
    }
    catch(const std::exception& e){
        std::cout << "Bad Float!" << tokens[3] << std::endl;
        std::cout << "Bad Float!" << tokens[4] << std::endl;
        throw;
    }

    OrderBookEntry
    obe{price,amount,tokens[0],tokens[1],OrderBookEntry::stringToOrderBookType(tokens[2])};
    return obe;

 }

CodePudding user response：

As mentioned in the comments, the first thing that should be optimized is the passing the arguments by value. I.e., there should be the const std::string& arg_name instead of std::string arg_name etc.

Then, the std::vector using is not so optimal too. It's because of the vector's nature: it stores its elements contiguously. It means that if at some point there will no memory to append the next element in such the way (contiguously), then all existing elements should be reallocated to a new place. So, if the vector using is not a critically needed, consider to substitute it with std::list or something else.

Then, I'd simplify the CSVReader::tokenise method something like that:

std::list<std::string> CSVReader::tokenise(const std::string& str, char sep)
{
  std::list<string> result;
  std::string::size_type cur = 0;
  for(auto pos = str.find_first_of(sep); pos != string::npos; cur = pos   1, pos = str.find_first_of(sep, cur))
    result.emplace_back(str, cur, pos - cur);
  result.emplace_back(str, cur);

  return result;
}

Then, I'd combine the tokenise and stringsToOBE methods.

In general, review the code which is executed in the loop.

CodePudding user response：

Think what sort of operations you are doing while parsing a single line and where the inefficiency comes from.

For instance,

std::vector<std::string> CSVReader::tokenise(std::string csvLine, char separator)

It accepts csvLine by copy. Copying a string is a potentially slow operation as it might involve memory allocation.

Then you create a vector of strings from a string... wouldn't it be much better to create a vector of std::string_view? So you don't copy each element of the string. To make the function safer require the tokenise to accept a std::string_view as well rather than std::string or const std::string&

Next, by reviewing CSVReader::stringsToOBE(std::vector<std::string>tokens), apparently I see that you actually require the size to be 5. And then parse some elements as floating points and copy some as string and another convert to enum.

Wouldn't it make more sense to require the input to be std::array<std::string_view,5>? To compile-time enforce the size to be 5 rather than test it? To adapt to this change you can make tokenize into a template that accepts a number n as template parameter and converts a std::string_view into std::array<std::string_view,n> based on the separator and in the end verifies that it is the end of the string view.

This saves another dynamic allocation due to elimination of unnecessary std::vector.

But why do you get a string from ifstream and then parse it and convert each element separately?

Why not straight up extract elements from the stream directly without creating an intermediate string? Say, add function OrderBookEntry CSVReader::extractOBEElement(std::istream& stream)? And just use stream's native functions like >>?

It might work and be more efficient but streams are known to be slow, so generating a string first might indeed be preferable.

CodePudding user response：

(1) The CSV format is a bit more complex that you assumed. For educational purposes you can ignore this but you should be aware.

(2) If you read numbers, you can read directly from the stream. The parsing will stop after reading the number:

if (!(is_ >> obe.price))
  throw runtime_error("could not read price");
ExpectComma();

(3) You can use getline to read a string up to an arbitrary delimiter, not necessarily a NL:

if (!getline(is_, obe.title, ','))
  throw runtime_error("could not read title");

if (!getline(is_, obe.isbn, '\n'))
  throw runtime_error("could not read isbn");

(4) The code after an exception is thrown is not executed. So, if an exception is thrown when you are trying to read price, when you catch it you cannot tell about amount.

(5) A simple implementation takes about ~4 seconds to read a 2’000’000 lines file (~100MB) on my computer (Xeon E3-1241 v3 @ 3.50GHz/32GB/Non-SSD), in Release configuration and ~40 seconds in Debug configuration.

(6) Using reserve, as some recommended, means you know enough about the file you are about to load, which is rarely true. The quick implementation I did does not use buffers nor it reserves memory.