Find the unique sorted values of a vector and also indexes of the sorted vector to regenerate the or-CodePudding

Suppose we have a vector like

a = {2,2,2,1,7,7,7,5,5}

I wish to get a sorted vector of unique elements which in this case gives

b = {1,2,5,7}

I also wish to get a vector of indexes of this sorted vector that reconstructs the initial vector:

c = {1,1,1,0,3,3,3,2,2}

Here b[x] for each element x in c returns the original vector a.

{ b[1],b[1],b[1],b[0],b[3],b[3],b[3],b[2],b[2] } = { 2,2,2,1,7,7,7,5,5 }

I wish to write an efficient algorithm in C to give me vectors b and c, given a vector a.

I have implemented a brute force solution below:

    std::vector<int> vector_a{ 2,2,2,1,7,7,7,5,5 };

    std::vector<int> vector_b(vector_a); 
    std::sort(vector_b.begin(), vector_b.end());
    std::vector<int>::iterator ip;

    ip = std::unique(vector_b.begin(), vector_b.end());
    vector_b.resize(std::distance(vector_b.begin(), ip));

    std::vector<int> vector_c;

    for (int i = 0; i < vector_a.size();   i) {
        for (int j = 0; j < vector_b.size();   j) {
            if (vector_a[i] == vector_b[j]) {
                vector_c.push_back(j);
            }
        }
    }

Is there a way I can implement a more efficient solution with complexity lesser than O(m*n)?

CodePudding user response：

I would use a map of vectors or lists. The index would be the unique value, and the value would be the list of indices in the original vector.

It can be built in one single pass (complexity O(n)) and can in turn build the b and c vectors in one single pass (again complexity O(n))

Possible code:

#include <iostream>
#include <map>
#include <vector>

int main()
{
    std::vector<int> a { 2,2,2,1,7,7,7,5,5 };
    std::map<int, std::vector<int> > m;
    for (int i = 0; i < a.size(); i  ) {
        m[a[i]].push_back(i);
    }
    std::vector<int> b(m.size()), c(a.size());
    int i = 0;
    for (auto& it : m) {
        for (int j : it.second) {
            c[j] = i;
        }
        b[i  ] = it.first;
    }
    std::cout << "b: " << b[0];
    for (int i = 1; i < b.size(); i  ) std::cout << ", " << b[i];
    std::cout << "\nc: " << c[0];
    for (int i = 1; i < c.size(); i  ) std::cout << ", " << c[i];
    std::cout << std::endl;
    return 0;
}

It gives as expected:

b: 1, 2, 5, 7
c: 1, 1, 1, 0, 3, 3, 3, 2, 2

After a second thought, the complexity should be slightly more than O(n) because inserting or searching in a std::map is not constant but is O(log(n)). But is should still be at most O(n*log(n))...

CodePudding user response：

What you want is what numpy.unique(a, return_inverse=True) returns in Python. I've found similar questions here: C equivalent of numpy.unique on std::vector with return_index and return_inverse, Find the indices and inverse mapping of a unique vector.

Here's a strategy on how to accomplish this:

Generate [0, a.size()) in a vector, i
Sort i based on the value in vector a at the given index
Find the unique values by iterating over i and finding the indices whose item in a are different from the index before (including the first). This is similar to the std::unique algorithm.
Find the "inverse indices" by iterating over i and assigning each index of the inverses by a counter that increments when there is a new unique value. This incrementing counter corresponds exactly to indices in the unique vector.

(The last two steps can be done together).

Here is an example:

template<typename T>
struct unique_indices {
    std::vector<T> uniques;
    std::vector<std::size_t> inverse_indices;
};

template<typename T, typename Cmp = std::less<T>>
unique_indices<T> sort_unique(const std::vector<T>& v, Cmp&& cmp = {}) {
    unique_indices<T> result;

    // generate indices
    std::vector<std::size_t> sorted_indices(v.size());
    std::iota(sorted_indices.begin(), sorted_indices.end(), std::size_t{0});

    // sort based on indices
    // (Could want std::stable_sort here if you care about which exact
    //  item is copied into `unique`, but doesn't matter for `int` with `std::less`)
    std::sort(sorted_indices.begin(), sorted_indices.end(), [&](std::size_t i, std::size_t j) {
        return cmp(v[i], v[j]);
    });

    // get unique values   set inverse indices
    // (similar to std::unique algorithm)
    auto first = sorted_indices.begin();
    auto last = sorted_indices.end();
    if (first == last) return result;  // v was empty

    result.inverse_indices.resize(v.size());

    std::size_t unique_index = 0;
    result.uniques.push_back(v[*first]);
    result.inverse_indices[*first] = unique_index;

    while (  first != last) {
        if (cmp(result.uniques.back(), v[*first])) {
            // new value
              unique_index;
            result.uniques.push_back(v[*first]);
        }
        result.inverse_indices[*first] = unique_index;
    }

    return result;
}

std::vector<int> vector_a = {2,2,2,1,7,7,7,5,5};
auto [vector_b, vector_c] = sort_unique(vector_a);
// vector_b == {1,2,5,7}
// vector_c == {1,1,1,0,3,3,3,2,2}

O(N lg N) runtime and O(N) extra space needed.
Your original solution was O(N lg N N M) time also with O(N) extra space.

(Where N = a.size() and M is the number of unique values in a)

CodePudding user response：

You indicated in a comment that in your use case, all values are in between [0, 255].

This allows to implement a variant of counting sort. Three steps:

Mark which values are used (array values): values[a[i] = 1.
looping over array values, add relevant ones in the vector unique, and at the same time associate each value to the corresponding index of the vector
Looping over the input array a[], directly find the corresponding index values[a[i]], and add this index in the vector indexes.

Total complexity: O(n)

#include <iostream>
#include <vector>
#include <array>
#include <utility>

#define NMAX 256

void print (const std::vector<int> &v, const std::string& mes = "") {
        std::cout << mes;
        for (int i: v) {
                std::cout << i << " ";
        }
        std::cout << "\n";
}

std::pair<std::vector<int>, std::vector<int>> sort_unique (const std::vector<int> &a) {
    int n = a.size();
    std::vector<int> unique, indexes (n);
    std::array<int, NMAX> values;
    values.fill (0);
    unique.reserve (NMAX);
    for (auto& x: a) {
        values[x] = 1;
    }
    int cpt = 0;
    for (int i = 0; i < NMAX;   i) {
        if (values[i] == 0) continue;
        values[i] = cpt  ;
        unique.push_back(i);
    }
    for (int i = 0; i < n;   i) {
        indexes[i] = values[a[i]];
    }
    return {unique, indexes};
}

int main() {
    std::vector<int> a = {2,2,2,1,7,7,7,5,5};
    auto [unique, indexes] = sort_unique (a);
    print (a, "input: ");
    print (unique, "unique values: ");
    print (indexes, "indexes: ");
    return 0;
}