C what is the best sorting container and approach for large datasets (millions of lines)-CodePudding

I'm tackling a exercise which is supposed to exactly benchmark the time complexity of such code.

The data I'm handling is made up of pairs of strings like this hbFvMF,PZLmRb, each string is present two times in the dataset, once on position 1 and once on position 2 . so the first string would point to zvEcqe,hbFvMF for example and the list goes on....

example dataset of 50k pairs

I've been able to produce code which doesn't have much problem sorting these datasets up to 50k pairs, where it takes about 4-5 minutes. 10k gets sorted in a matter of seconds.

The problem is that my code is supposed to handle datasets of up to 5 million pairs. So I'm trying to see what more I can do. I will post my two best attempts, initial one with vectors, which I thought I could upgrade by replacing vector with unsorted_map because of the better time complexity when searching, but to my surprise, there was almost no difference between the two containers when I tested it. I'm not sure if my approach to the problem or the containers I'm choosing are causing the steep sorting times...

Attempt with vectors:

#include <iostream>
#include <fstream>
#include <string>
#include <sstream>
#include <map>
#include <unordered_map>
#include <stdio.h>
#include <vector>
#include <iterator>
#include <utility>
#include <functional>
#include <algorithm>

using namespace std;


template<typename T>
void search_bricks_backwards(string resume, vector<T>& vec, vector<string>& vec2) {
    int index = 0;
    int temp_index = 0;
    while (true) {
        if (index == vec.size()) {
            vec2.insert(vec2.begin(), vec[temp_index].first); 
            cout << "end of backward search, exitting..." << endl;
            break;


        }
        
        if (vec[index].second == resume) {
            vec2.insert(vec2.begin(), resume);
            

            resume = vec[index].first;
            //vec.erase(vec.begin()   index);
            temp_index = index;

            index = 0;
        }
        
        index  ;
    }

}


template<typename T>
void search_bricks(string start, vector<T>& vec, vector<string>& vec2) {
    int index = 0;
    int temp_index = 0;
    while (true) {
        //cout << "iteration " << index << endl;
        if (index == vec.size()) {
            vec2.push_back(vec[temp_index].second);
            
            cout << "all forward bricks sorted" << endl;
            break;


        }
        if (vec[index].first == start) {
            vec2.push_back(vec[index].first);
            
            
            start = vec[index].second;
            //vec.erase(vec.begin()   index);
            temp_index = index;
            index = 0;
            
        }
        
        index  ;
    }

    search_bricks_backwards(vec[0].first, vec, vec2);

}

template<typename T>
void search_bricks_recursion(string start, vector<T>& vec, vector<string>& vec2) {
    int index = 0;
    for (const auto& pair : vec) {
        //cout << "iteration " << index << endl;
        if (pair.first == start) {
            vec2.push_back(start);
            cout << "found " << start << " and " << pair.first << endl;
            search_bricks(pair.second, vec, vec2);
        }
        if (index   1 == vec.size()) {
            search_bricks_backwards(start, vec, vec2);
            

        }
        index  ;
    }
    
}

template<typename T>
void printVectorElements(vector<T>& vec)
{
    for (auto i = 0; i < vec.size();   i) {
        cout << "(" << vec.at(i).first << ","
            << vec.at(i).second << ")" << " ";
    }
    cout << endl;
}

vector<string> split(string s, string delimiter) {
    size_t pos_start = 0, pos_end, delim_len = delimiter.length();
    string token;
    vector<string> res;

    while ((pos_end = s.find(delimiter, pos_start)) != string::npos) {
        token = s.substr(pos_start, pos_end - pos_start);
        pos_start = pos_end   delim_len;
        res.push_back(token);
    }

    res.push_back(s.substr(pos_start));
    return res;
}

unordered_map<string, string> brick_to_map(string const& s)
{
    unordered_map<string, string> m;

    string key, val;
    istringstream iss(s);

    while (getline(getline(iss, key, ',') >> ws, val))
        m[key] = val;

    return m;
}

int main()
{
    vector<pair<string, string>> bricks;
    
    vector<string> sorted_bricks;

    ifstream inFile;
    inFile.open("input-pairs-50K.txt"); //open the input file

    stringstream strStream;
    strStream << inFile.rdbuf(); //read the file
    string str = strStream.str(); //str holds the content of the file

    //cout << str << endl;
    
    istringstream iss(str);
    
    for (string line; getline(iss, line); )
    {
     
        string delimiter = ",";
        string s = line;
        vector<string> v = split(s, delimiter);
        string s1 = v.at(0);
        string s2 = v.at(1);
        

        bricks.push_back(make_pair(s1, s2));
    }

   
    search_bricks(bricks[0].second, bricks, sorted_bricks);
    
    
    //display the results
    for (auto i = sorted_bricks.begin(); i != sorted_bricks.end();   i)
        cout << *i << " ";


    
 
}

Attempt with unsorted map:

#include <iostream>
#include <fstream>
#include <string>
#include <sstream>
#include <map>
#include <unordered_map>
#include <stdio.h>
#include <vector>
#include <iterator>
#include <utility>
#include <functional>
#include <algorithm>

using namespace std;


void search_bricks_backwards(string resume, unordered_map<string, string> brick_map, vector<string>& vec2) {
    
    typedef unordered_map<string, string>::value_type map_value_type;
    while (true) {

        unordered_map<string, string>::const_iterator got = find_if(brick_map.begin(), brick_map.end(), [&resume](const map_value_type& vt)
            { return vt.second == resume; }
        );
        if (got == brick_map.end()) {
            vec2.insert(vec2.begin(), resume); 
            cout << "end of backward search, exitting..." << endl;
            break;


        }
        //cout << "iteration " << index << endl;
        else if (got->second == resume) {
            vec2.insert(vec2.begin(), resume);

            
            resume = got->first;
        
        }

       
    }

}


void search_bricks(string start, unordered_map<string, string> brick_map, vector<string>& vec2) {
    
    typedef unordered_map<string, string>::value_type map_value_type;
    while (true) {
        

        unordered_map<string, string>::const_iterator got = find_if(brick_map.begin(), brick_map.end(), [&start](const map_value_type& vt)
            { return vt.first == start; }
        );
        if (got == brick_map.end()) {
            vec2.push_back(start);

            cout << "all forward bricks sorted" << endl;
            
            break;
        }
        else if (got->first == start) {
            vec2.push_back(start);

            //cout << "found " << start << " and " << vec[index].first << endl;
            start = got->second;
            
        }
    }
    auto it = brick_map.begin();
    search_bricks_backwards(it->first, brick_map, vec2);
    


    

    

}


template<typename T>
void printVectorElements(vector<T>& vec)
{
    for (auto i = 0; i < vec.size();   i) {
        cout << "(" << vec.at(i).first << ","
            << vec.at(i).second << ")" << " ";
    }
    cout << endl;
}

vector<string> split(string s, string delimiter) {
    size_t pos_start = 0, pos_end, delim_len = delimiter.length();
    string token;
    vector<string> res;

    while ((pos_end = s.find(delimiter, pos_start)) != string::npos) {
        token = s.substr(pos_start, pos_end - pos_start);
        pos_start = pos_end   delim_len;
        res.push_back(token);
    }

    res.push_back(s.substr(pos_start));
    return res;
}


int main()
{
    unordered_map<string, string> bricks;
    
    vector<string> sorted_bricks;

    ifstream inFile;
    inFile.open("input-pairs-50K.txt"); //open the input file

    for (string line; getline(inFile, line); )
    {

        string delimiter = ",";
        string s = line;
        vector<string> v = split(s, delimiter);
        string s1 = v.at(0);
        string s2 = v.at(1);


        bricks.insert(make_pair(s1, s2));
    }


    /*for (auto& x : bricks)
        std::cout << x.first << "," << x.second << " ";*/


    auto it = bricks.begin();
    search_bricks(it->second, bricks, sorted_bricks);


    // display results
    for (auto i = sorted_bricks.begin(); i != sorted_bricks.end();   i)
        cout << *i << " ";




}

I'm looking to improve the time complexity of my code to be able to process the data more eficiently, if anyone can suggest what to improve in my code or container wise I'd be very thankful.

CodePudding user response：

The biggest killer in your algorithm are those sequence scans. Maps offer key lookup operations. Their find member is what makes them tick (and shine). If you want to make this scream, you need to use keying to do so. A close runner up to that is the exorbitant expense of pre-pending to a std::vector, which can get ridiculously expensive fast.

Before I go on, one thing. I scanned the source test you provided. There are actually 49999 rows in the data, not 50000. After some trial, error, head scratching, etc., I discovered this:

EZkYGM appears in the file only once, as a right side
bWyUVV appears only once, as a left side.

I suspect these are "terminals" of the sorted list? The results seemed to suggest that is indeed the case. But hey, I'm rather glad I finally understood what you meant by "sorting" in this dataset context (and it only took you beating it into my head for two days for it to finally click).

Anyway...

Performance Improvements All Around

Stripping a lot of code out of this while still keeping the basic premise, consider the following:

#include <iostream>
#include <fstream>
#include <string>
#include <sstream>
#include <unordered_map>
#include <deque>
#include <algorithm>
#include <chrono>

void search_bricks_backwards
(
    std::string resume, 
    std::unordered_map<std::string, std::string>& rbricks, 
    std::deque<std::string>& dst
) 
{
    while (true) 
    {
        auto it = rbricks.find(resume);
        dst.emplace_front(std::move(resume));

        if (it == rbricks.end()) 
            break;

        resume = it->second;
        rbricks.erase(it);
    }
}

void search_bricks
(
    std::unordered_map<std::string, std::string>& bricks, 
    std::unordered_map<std::string, std::string>& rbricks, 
    std::deque<std::string>& dst
) 
{
    // remember these
    std::string start = bricks.begin()->second;
    std::string resume = bricks.begin()->first;

    while (true) 
    {
        auto it = bricks.find(start);
        dst.emplace_back(std::move(start));

        if (it == bricks.end()) 
            break;

        start = it->second;
        bricks.erase(it);
    }

    // same search, but different keyed map
    search_bricks_backwards(resume, rbricks, dst);
}


int main()
{
    using namespace std::chrono;

    std::unordered_map<std::string, std::string> bricks, rbricks;
    std::deque<std::string> sorted_bricks;

    std::ifstream inFile("input-pairs-50K.txt");
    if (inFile.is_open())
    {
        for (std::string line; std::getline(inFile, line);)
        {
            // make damn sure we get two keys
            std::istringstream iss(line);
            std::string s1, s2;
            if (std::getline(iss, s1, ',') && 
                std::getline(iss, s2) &&
                !s1.empty() &&
                !s2.empty())
            {
                bricks.emplace(std::make_pair(s1, s2));
                rbricks.emplace(std::make_pair(s2, s1));
            }
        }

        auto tp0 = high_resolution_clock::now();
        search_bricks(bricks, rbricks, sorted_bricks);
        auto tp1 = high_resolution_clock::now();

        std::cout << duration_cast<milliseconds>(tp1-tp0).count() << "ms\n";

        // display results
        int n = 0;
        for (auto i = sorted_bricks.begin(); i != sorted_bricks.end();   i)
            std::cout <<   n << ". " << *i << '\n';
    }
}

The most notable differences:

Using std::deque<std::string> as the sorted results. You were paying a dear price for those vector placements. All of your vector operations were either pushing on the back (ok for reserved vectors) or pushing on the front (dreadful performance in vectors). The std::deque is specialized for very fast front and back insertion and pruning. We're not doing the latter, but are heavily doing the former.
Two maps, keying s1==>s2 and s2==>s1 respectively. there are third party containers that do this for you (a commonly referenced one is boost::bimap). For this, however, considering the key sizes, just storing two maps is easy.
Move semantics are used (though not much) where appropriate
During searches, each discovered key is deleted from the map just-searched. This ensures we stop when we're supposed to, and recovers memory footprint fairly quickly.
Reference parameters where applicable. These containers are effectively destroyed while building the result set, but that's ok (and in fact warranted to properly detect a terminal on one end or the other).

The double-keying makes a world of difference. A release build using your test data set on a puny little four-year-old macbook pro laptop produces the following (which you can verify with your know-answer tests).

All code is built with -O2 optimization running Apple clang version 13.0.0 (clang-1300.0.29.30)

std::unordered_map

34ms
1. bWyUVV
2. mPBGRC
3. WfkvWy
4. vjWNHY
5. HudtyD
6. DhxjdV
7. kdWhGX
8. tIsDXh
9. eMVeMX
10. fVQoeG

... 49980 lines omitted ...

49990. YfBDnP
49991. sHKVrT
49992. ZzhoZV
49993. Dyunmj
49994. KCQbpj
49995. rbMSgD
49996. WKOksU
49997. qqMTnq
49998. llrqUI
49999. XYpBnk
50000. EZkYGM

That timing figure is fairly consistent, though there were outliers as high as 42ms and as low as 28ms in my testing. The same harness using a regular std::map resulted in:

std::map

92ms
1. bWyUVV
2. mPBGRC
3. WfkvWy
4. vjWNHY
5. HudtyD
6. DhxjdV
7. kdWhGX
8. tIsDXh
9. eMVeMX
10. fVQoeG

... 49980 lines omitted ...

49990. YfBDnP
49991. sHKVrT
49992. ZzhoZV
49993. Dyunmj
49994. KCQbpj
49995. rbMSgD
49996. WKOksU
49997. qqMTnq
49998. llrqUI
49999. XYpBnk
50000. EZkYGM

So you're definitely using the right container with an unordered keying, you just weren't actually using the keying, making the container basically no better than a sequential scan. That jives with your assessment it really wasn't any better than a vector sequential scan solution; you were basically doing that regardless.

I suspect performance shown above should be able to run sets of your expected 5-million pairs, so long as you can keep it all in memory (twice, sorry about that, but the results show pretty candidly that it is worth that price).

CodePudding user response：

You can use a trie data structure, here's a paper that explains an algorithm to do that: https://people.eng.unimelb.edu.au/jzobel/fulltext/acsc03sz.pdf

But you have to implement the trie from scratch because as far as I know there is no default trie implementation in c .