There are hundreds of thousands of files that I need to scan everyday to remove duplicate entries from. Each of these files in turn have several thousand records
Sample input file
2019-10-04,3.9,3.29,5.85,6.15
2019-10-05,3.8,7.02,5.69,6.83
2019-10-05,3.8,8.02,8.69,1.83
2019-10-07,1.8,1.02,4.69,7.83
Here is the script I've written for it which takes about an hour or over to complete.
Script
#!/bin/bash
LOOKUP_DIR="/path/to/source_files"
CLEANEDUP_DIR="/path/to/cleaned_content"
remove_dup(){
fname=${1}
awk -F"," 'prev && ($1 != prev) {print seen[prev]} {seen[$1] = $0; prev = $1} END {print seen[$1]}' "${fname}" > "${CLEANEDUP_DIR}/${fname}"
}
cd ${LOOKUP_DIR}
for k in *.csv
do
remove_dup "${k}" &
done
wait
The way to check duplicates is to look at the first field & if there are multiple entries for this field (date in this case) only the last line with this date needs to be retained & the rest removed.
Is there a way to optimise the logic I've written please?
CodePudding user response:
Try:
tac thefile | sort -urst, -k1,1
Optimise performance
Rewrite it in a single programming language. Do not use processes - use threads for each file. For scripting, use Python or Ruby. For compiling, use C or C. This took like just less than an hour to write, and most probably is immensely faster than fork()
ing a new process for each file:
#include <map>
#include <future>
#include <iostream>
#include <fstream>
#include <algorithm>
#include <filesystem>
#include <ostream>
#include <vector>
std::string algo(const std::filesystem::path& file) {
std::map<std::string, std::string> lines;
std::ifstream ffile(file);
std::string line;
std::string field;
size_t pos;
while (std::getline(ffile, line)) {
pos = 0;
for (auto&& c : line) {
if (c == ',') {
break;
}
pos ;
}
field = line.substr(0, pos);
lines.insert_or_assign(std::move(field), std::move(line));
}
std::ostringstream of;
for (auto&& i : lines) {
of << i.second << '\n';
}
return of.str();
}
int main(int argc, char *argv[]) {
const std::filesystem::path p(argv[1]);
std::vector<
std::pair<
std::string, std::future<std::string>
>
> results;
if (std::filesystem::is_regular_file(p)) {
std::cout << algo(p) << '\n';
} else {
for (auto&& f : std::filesystem::directory_iterator(p)) {
if (f.is_regular_file()) {
results.emplace_back(f.path(), std::async(algo, f.path()));
}
}
}
for (auto&& r : results) {
std::cout
<< "=== " << r.first << " ===\n\n"
<< r.second.get() << '\n';
}
}
CodePudding user response:
If I understand your question and you want to remove any duplicated records from within each file, then using a pair of arrays in awk
, the first using a counter as the index so record order is maintained storing the 5 fields joined by SUBSEP
for the stored value. The second array indexed by the 5 fields joinded by SUBSEP
holds the record as the stored value. This allows a simple check if the 5 fields have been seen before using the index in array
test.
Rather than writing the script in remove_dup()
, just write an executable awk
script that is called from remove_dup()
. The script could be:
#!/usr/bin/awk -f
BEGIN { FS="," }
{ if ($1 SUBSEP $2 SUBSEP $3 SUBSEP $4 SUBSEP $5 in array)
next
order[ n] = $1 SUBSEP $2 SUBSEP $3 SUBSEP $4 SUBSEP $5
array[$1,$2,$3,$4,$5] = $0
}
END {
for (i=1; i<=n; i )
print array[order[i]]
}
(above a record is only stored if the joined fields do NOT already exist as an index in array
ensuring all duplicates are removed, keeping the order of the first occurrence in tact and discarding all others)
Then you can modify your script as:
#!/bin/bash
LOOKUP_DIR="/path/to/source_files"
CLEANEDUP_DIR="/path/to/cleaned_content"
AWKSCRIPT="/path/to/executable/awkscript"
remove_dup(){
fname=${1}
$AWKSCRIPT "${fname}" > "${CLEANEDUP_DIR}/${fname}"
}
cd ${LOOKUP_DIR}
for k in *.csv
do
remove_dup "${k}" &
done
wait
(note the addition of the path to the executable awkscript
stored in the variable AWKSCRIPT
)
That should do what you are after.