Home > database >  Deduplication optimization. C
Deduplication optimization. C

Time:01-08

The problem is as follows. I want a function that, given a list and a max number of occurences "x", deletes all elements of the list that appear more than x times or x times.

I found a pretty straightforward solution, which is to check for each of the elements. This said, to repeat the find and delete functions many times seems computationally-wise not optimal to me.

I was wondering whether you could provide a better algorithm (i excluded allocating memory for a matrix from the min to the max... just too much for the task... say you have few very big numbers and your memory won't do it.)

My code follows.

typedef struct n_s
{
    int val;
    struct n_s *next;
}
n_t;

// deletes all elements equal to del in list with head h
n_t * delete(n_t *h, int del);
// returns the first occurrence of find in list with head h, otherwise gives NULL
n_t * find(n_t *h, int find);

n_t *
delFromList(n_t *h, int x)
{
    int val;
    n_t *el, *posInter;

    // empty list case
    if (h == NULL)
        return NULL;

    // first element
    val=h->val;
    if ( (posInter = find(h -> next,val)) 
        && (find(posInter -> next, val)))
        h = delete(h, val);

    // loop from second element
    el = h;
    while (el -> next)
    {
        val = el -> next -> val;
        // check whether you want to delete the next one, 
        // and then if you do so, check again on the "new" next one
        if ((posInter = find(el -> next -> next, val))                   
            && (find(posInter -> next, val)))
            el -> next = delete(el -> next, val);
        // in case you did not delete the nexy node, you can move on
        else 
            el = el -> next;

    }

    return h;
}

I know that the el->next->next may look confusing, but i find less intuitive to use variables such as "next", "past"... so, sorry for your headache.

Thank you for your time.

CodePudding user response:

One option for an algorithm with improved performance is:

  • Define a data structure D with two members, one for the value of a list element and one to count the number of times it appears.
  • Initialize an empty balanced tree ordered by value.
  • Iterate through the list. For each item in the list, look it up in the tree. If it is not present, insert a D structure into that tree with its value member copied from the list element and its count set to one. If it is present in the tree, increments its count. If its count equals or exceeds the threshold, remove it from the list.

Lookups and insertions in a balanced tree are O(log n). A linked list of n items uses n of them, and deletions from a linked list are O(1). So the total time is O(n log n).

CodePudding user response:

Use a counting map to count the number of times each element appears. The keys are the elements, and the values are the counts.

Then, go through your array a second time, deleting anything which meets your threshold.

O(n) time, O(n) extra space.

  • Related