Remove duplicate entries in file - Optimise performance-CodePudding

There are hundreds of thousands of files that I need to scan everyday to remove duplicate entries from. Each of these files in turn have several thousand records

Sample input file

2019-10-04,3.9,3.29,5.85,6.15
2019-10-05,3.8,7.02,5.69,6.83
2019-10-05,3.8,8.02,8.69,1.83
2019-10-07,1.8,1.02,4.69,7.83

Here is the script I've written for it which takes about an hour or over to complete.

Script

#!/bin/bash

LOOKUP_DIR="/path/to/source_files"
CLEANEDUP_DIR="/path/to/cleaned_content"


remove_dup(){
    fname=${1}
    awk -F"," 'prev && ($1 != prev) {print seen[prev]} {seen[$1] = $0; prev = $1} END {print seen[$1]}' "${fname}" > "${CLEANEDUP_DIR}/${fname}"
}

cd ${LOOKUP_DIR}
for k in *.csv
do 
    remove_dup "${k}" &
done

wait

The way to check duplicates is to look at the first field & if there are multiple entries for this field (date in this case) only the last line with this date needs to be retained & the rest removed.

Is there a way to optimise the logic I've written please?

CodePudding user response：

Try:

tac thefile | sort -urst, -k1,1

Optimise performance

Rewrite it in a single programming language. Do not use processes - use threads for each file. For scripting, use Python or Ruby. For compiling, use C or C. This took like just less than an hour to write, and most probably is immensely faster than fork()ing a new process for each file:

#include <map>
#include <future>
#include <iostream>
#include <fstream>
#include <algorithm>
#include <filesystem>
#include <ostream>
#include <vector>

std::string algo(const std::filesystem::path& file) {
    std::map<std::string, std::string> lines;
    std::ifstream ffile(file);
    std::string line;
    std::string field;
    size_t pos;
    while (std::getline(ffile, line)) {
        pos = 0;
        for (auto&& c : line) {
            if (c == ',') {
                break;
            }
            pos  ;
        }
        field = line.substr(0, pos);
        lines.insert_or_assign(std::move(field), std::move(line));
    }
    std::ostringstream of;
    for (auto&& i : lines) {
        of << i.second << '\n';
    }
    return of.str();
}

int main(int argc, char *argv[]) {
    const std::filesystem::path p(argv[1]);
    std::vector<
        std::pair<
            std::string, std::future<std::string>
            >
        > results;
    if (std::filesystem::is_regular_file(p)) {
        std::cout << algo(p) << '\n';
    } else {
        for (auto&& f : std::filesystem::directory_iterator(p)) {
            if (f.is_regular_file()) {
                results.emplace_back(f.path(), std::async(algo, f.path()));
            }
        }
    }
    for (auto&& r : results) {
        std::cout
            << "=== " << r.first << " ===\n\n"
            << r.second.get() << '\n';
    }
}

CodePudding user response：

If I understand your question and you want to remove any duplicated records from within each file, then using a pair of arrays in awk, the first using a counter as the index so record order is maintained storing the 5 fields joined by SUBSEP for the stored value. The second array indexed by the 5 fields joinded by SUBSEP holds the record as the stored value. This allows a simple check if the 5 fields have been seen before using the index in array test.

Rather than writing the script in remove_dup(), just write an executable awk script that is called from remove_dup(). The script could be:

#!/usr/bin/awk -f

BEGIN { FS="," }

{ if ($1 SUBSEP $2 SUBSEP $3 SUBSEP $4 SUBSEP $5 in array)
    next
  order[  n] = $1 SUBSEP $2 SUBSEP $3 SUBSEP $4 SUBSEP $5
  array[$1,$2,$3,$4,$5] = $0
}

END {
  for (i=1; i<=n; i  )
    print array[order[i]]
}

(above a record is only stored if the joined fields do NOT already exist as an index in array ensuring all duplicates are removed, keeping the order of the first occurrence in tact and discarding all others)

Then you can modify your script as:

#!/bin/bash

LOOKUP_DIR="/path/to/source_files"
CLEANEDUP_DIR="/path/to/cleaned_content"
AWKSCRIPT="/path/to/executable/awkscript"

remove_dup(){
    fname=${1}
    $AWKSCRIPT "${fname}" > "${CLEANEDUP_DIR}/${fname}"
}

cd ${LOOKUP_DIR}
for k in *.csv
do 
    remove_dup "${k}" &
done

wait

(note the addition of the path to the executable awkscript stored in the variable AWKSCRIPT)

That should do what you are after.