Find X-largest values in a large file with optional input file command line parsing method in C-CodePudding

I have a file in the following fixed format:

<unique record identifier> <white_space> <numeric value>

e.g.

1426828011 9
1426828028 350
1426828037 25
1426828056 231
1426828058 109
1426828066 111
.
.
.

I want to write a program that reads from 'stdin' the contents of a file, and optionally accepts the absolute path of a file from the command line. The file/stdin stream is expected to be in the above format. The output should be a list of the unique ids associated with the X-largest values in the rightmost column, where X is specified by an input parameter.

For example, given the input data above and X=3, the following would be valid output:

1426828028
1426828066
1426828056

Note that the output does not need to be in any particular order. Multiple instances of the same numeric value count as distinct records of the total X. So if we have 4 records with values: 200, 200, 115, 110 and X=2 then the result must consist of the two IDs that point to 200 and 200 and no more.

Notice: take into account extremely large files.

My idea and brief implementation: Sorting by k-largest values 1st way: I want to read file content into multimap then iterate k elements to output 2nd way: Read file data into a vector<pair<int, int>> then use heap sort (priority queue).

I'm wondering which way has better time complexity & higher performance? Time complexity of 2nd way should be nlog(n). Is time complexity of 1st way log(n)? Please tell me both time & space complexity of the above methods and suggest any other better methods.

Besides, the input file is huge, so I think of using external sort. But I haven't done it before. I'd appreciate if someone can instruct me or write sample code of it for my better understanding.

Anyways, it's not required to sort output. We only need to print X-largest values in any order. So I'm wondering whether I need to do any sorting algorithm. The requirement to print the X-largest values in any order is weird, because we must sort it in descending order before printing. So I even don't know why it says "in any order" as if it makes the problem easier.

My brief code:

#include <iostream>
#include <fstream>
#include <string>
#include <algorithm>
//#include "stdafx.h"

using namespace std;

std::multimap<int, int> mp;

typedef std::pair<std::string, int> mypair;

struct IntCmp {
    bool operator()(const mypair &lhs, const mypair &rhs) {
        return lhs.second < rhs.second;
    }
};

void printK(const std::map<std::string,int> &mymap, int k) {
    std::vector<mypair> myvec(mymap.begin(), mymap.end());
    assert(myvec.size() >= k);
    std::partial_sort(myvec.begin(), myvec.begin()   k, myvec.end(), IntCmp());

    for (int i = 0; i < k;   i) {
        std::cout << i << ": " << myvec[i].first 
            << "-> " << myvec[i].second << "\n";
    }
}

void readinfo(std::istream& in)
{
    std::string s, ID, Value;
    //while(getline(in, s))
        while(in >> ID >> Value)
        std::cout << s << '\n';
}

int main (int argc, char **argv) {

    if (argc > 1) {     /* read from file if given as argument */
        std::ifstream fin (argv[1]);
        if (fin.is_open())
            readinfo(fin);
        else 
        {
            std::cerr << "error: Unable to open file " << argv[1] << '\n';
            return 1;
        }
    }
    else 
        // No input file has been passed in the command line.
        // Read the data from stdin (std::cin).
    {
        readinfo(std::cin);
    }

    return 0;
}

But I don't know how to split the huge file to sort and combine back together. Please tell me how to fix my code for this problem.

CodePudding user response：

Maybe you could use a min-heap via std::priority_queue:

#include <cstdlib>
#include <fstream>
#include <iostream>
#include <queue>
#include <vector>

struct IdAndValue {
  std::string id;
  int value;
};

struct ValueCmp {
  bool operator()(const IdAndValue &lhs, const IdAndValue &rhs) {
    return lhs.value > rhs.value;
  }
};

void PrintTopK(std::istream &in, long k) {
  std::priority_queue<IdAndValue, std::vector<IdAndValue>, ValueCmp> largest_k;
  std::string id;
  int value;
  while (in >> id >> value) {
    if (largest_k.size() < k) {
      largest_k.push({.id = id, .value = value});
    } else {
      if (value > largest_k.top().value) {
        largest_k.pop();
        largest_k.push({.id = id, .value = value});
      }
    }
  }
  std::cout << "Top " << k << " IDs with largest values:\n";
  while (!largest_k.empty()) {
    IdAndValue id_and_value = largest_k.top();
    largest_k.pop();
    std::cout << id_and_value.id << '\n';
  }
}

int main(int argc, char **argv) {
  if (argc > 2) {  // Read from file if given as argument.
    std::ifstream fin(argv[1]);
    if (fin.is_open())
      PrintTopK(fin, std::strtol(argv[2], nullptr, 10));
    else {
      std::cerr << "Error: Unable to open file " << argv[1] << '\n';
      return 1;
    }
  } else {  // Read the data from stdin (std::cin).
    PrintTopK(std::cin, std::strtol(argv[1], nullptr, 10));
  }
  return 0;
}

Usage from stdin (Ctrl D to send EOF on unix):

./PrintTopK 3
1426828011 9
1426828028 350
1426828037 25
1426828056 231
1426828058 109
1426828066 111
Top 3 IDs with largest values:
1426828066
1426828056
1426828028

Usage when passed in a file:

$ ./PrintTopK /Users/shash/CLionProjects/PrintTopK/data.txt 3
Top 3 IDs with largest values:
1426828066
1426828056
1426828028

With data.txt:

1426828011 9
1426828028 350
1426828037 25
1426828056 231
1426828058 109
1426828066 111

CodePudding user response：

I think we can come up with a better approach that has a lower space and time complexity.

One requirement is to get the x largest values. Then we do only need to store x values. Not more. The others are of no interest. All values will be read and, if not larger than the already collected values, then we discard them. With that, we save tons of memory.

Next, how to store?

If we have an already sorted container, then the smallest element is always at the beginning. So, if we read a new value, then we just need to compare this new value with the first element in the container. Because, if the new value would be smaller than the smallest existing value, we can discard it. But if it is bigger, then we need to add it to our container, and eliminate the so far smallest element.

If we use a function like std::lower_bound then it will give us the exact position on where we need to insert the new element. Without the need for any resorting. It can be inserted at the exact correct position. Then we have a new smallest value.

To select the type of container, we think of the operations that we need to do.

We want to eliminate the first element (Without shifting all other data). - We want to add an element in a given position, without the need to shift all following values to the right.

This leads us to a std::list, which will fulfil our criteria in an optimal way.

So, how will we implement the solution?

Define a struct to hold the data, the unique id and the associated value
Add extraction >> and insertion << operators for easier IO
Add 2 sort operator overloads for std::list::sort and std::lower_bound
In main, get an std::istreameither to a given source file or std::cin
Read the first X values and store them in the list as is. If there should be only these X values or less, then we have the solution already
Sort the values in the list. The smalles value is now at the front of the list
If the std::istream contains still data, then continue to read values
If the new value id greater than than the smalled value in the list, then add the value to the list with insert sort
Delete the smallest value at the front.

After the initial sorting, all operations will be done in constant time. The number of input values does not add additional complexity. Anyway. Most time will be burned with reading data from the disk, if the file is huge. For small files with 100'000 values or so, any other algorithm would be also fine.

Please see one of many potential solutions below.

#include <iostream>
#include <fstream>
#include <random>
#include <string>
#include <list>
#include <limits>
#include <algorithm>

const std::string fileName{ "r:\\big.txt" };

// ----------------------------------------------------------------------------
// Create a big file for Test purposes
void createBigFile() {

    if (std::ofstream ofs(fileName); ofs) {

        constexpr size_t uniqueIDStart = 1'426'828'028;
        constexpr size_t numberOfRecords = 10'000'000;
        constexpr size_t endRecords = uniqueIDStart   numberOfRecords;

        std::random_device randomDevice;
        std::mt19937 randomEngine(randomDevice());
        std::uniform_int_distribution<int> uniformDistribution(1, 10'000'000);

        for (size_t k{ uniqueIDStart }; k < endRecords;   k) {
            ofs << k << ' ' << uniformDistribution(randomEngine) << '\n';
        }
    }
}
// ----------------------------------------------------------------------------


// Here we will store our data
struct Data {
    unsigned int ID{};
    int value{ std::numeric_limits<int>::max() };

    // Sorting operators
    bool operator < (const int& i) const { return value < i; }                  // For lower_bound
    bool operator < (const Data& other) const { return value < other.value; }   // For sort

    // Simple extractor and inserter
    friend std::istream& operator >> (std::istream& is, Data& d) { return is >> d.ID >> d.value; }
    friend std::ostream& operator << (std::ostream& os, const Data& d) { return os << d.ID << ' ' << d.value; }
};


// Whatever number of X you need for the X-largest values
constexpr size_t Rank = 50;
// We will use a list to store the C-largest data
using DList = std::list<Data>;   
using ListIter = DList::iterator;

// For faster reading we will increase the input buffer size
constexpr size_t ifStreamBufferSize = 500'000u;
static char buffer[ifStreamBufferSize];

int main(int argc, char* argv[]) {

    // If you want to create test data, then uncomment the following line
    //createBigFile();

    //We will either read from std::cin or from a file
    std::shared_ptr<std::istream> input{};
    if (argc == 2) {
        // Try to open the source file, given by command line arguments
        input.reset(new std::ifstream(argv[1]));
        input->rdbuf()->pubsetbuf(buffer, ifStreamBufferSize);
    }
    else {
        // Use std::cin for input. Handover a NoOp custom deleter. We do not want to close std::cin
        input.reset(&std::cin, [](...) {});
    }

    // If file could be opened or if std::cin is OK
    if (input) {

        // Here we will store all data
        DList dList;

        // Read the first X values as is
        size_t numberOfElementsInArray{};
        Data data;
        
        while (*input >> data) {
            if (numberOfElementsInArray < Rank) {
                dList.push_front(std::move(data));
                  numberOfElementsInArray;
            }
            if (numberOfElementsInArray >= Rank) break;
        }
        // And do a first time (and the only sort)
        dList.sort();

        // For comparison
        int smallest{ dList.front().value };
        
        // Read all data from file
        while (*input >> data) {
           
            // If the latest read value is bigger than the smalles in the list, the we need to add a new value now
            if (data.value > smallest) {

                // FInd the position, where to insert the new element
                ListIter dIter = std::lower_bound(dList.begin(), dList.end(), data.value);
                if (dIter != dList.end()) {
                    // Insert new value where is should be. Keeps sorting
                    dList.insert(dIter,std::move(data));

                    // We have now more values then needed. Get rid of the smalles one and get new smallest value.
                    dList.pop_front();
                    smallest = dList.front().value;
                }
            }
        }
        std::copy(dList.rbegin(), dList.rend(), std::ostream_iterator<Data>(std::cout, "\n"));
    }
    else std::cerr << "*** Error with input file (" << fileName << ") or with std::cin\n\n";
}