Most memory efficient way to remove duplicate lines in a text file using C-CodePudding

I understand how to do this using std::string and std::unordered_set, however, each line and each element of the set takes up a lot of unnecessary, inefficient memory, resulting in an unordered_set and half the lines from the file being 5 -10 times larger than the file itself.

Is it possible (and how, if so) to somehow reduce memory consumption, for example, so that you can remove duplicates from a 10 gigabyte file using no more than 20 gigabytes of RAM? In this case, of course, it is necessary to do this at a speed of O(n).

CodePudding user response：

This code is reading the input file line by line, storing only hashes of the strings in memory. If the line was not seen before, it writes the result into the output file. If the line was seen before, it does not do anything.

It uses sparsepp to reduce the memory footprint.

Input data:

12 GB file size
~197.000.000 different lines
line length < 120 characters

Build:

C 20
release build
do not run within Visual Studio (no debugger attached)

Processing:

AMD Ryzen 2700X
32 GB of RAM
Seagate SSD
190 seconds
954 MB virtual memory peak

Is that good enough? I can't tell, because your performance requirements are quite vague and you didn't give proper performance comparison data. It may depend on your machine, your data, your RAM size, your hard disk speed, ...

#include <chrono>
#include <iostream>
#include <fstream>
#include <algorithm>
#include <array>
#include <cstring>
#include <functional>
#include <random>
#include <string>
#include "spp.h"
int main()
{
    std::chrono::steady_clock::time_point begin = std::chrono::steady_clock::now();
    std::ifstream input;
    std::ofstream output;
    input.open("10gb.txt");
    output.open("10gb_nodupes.txt");
    std::string inputline;
    spp::sparse_hash_map<size_t, size_t> hashes;
    while (std::getline(input, inputline))
    {
        std::size_t hash = std::hash<std::string>{}(inputline);
        if (!hashes.contains(hash))
        {
            output << inputline << '\n';
            hashes[hash]=0;
        }
    }
    input.close();
    output.close();
    std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now();
    std::cout << "Time difference = " << std::chrono::duration_cast<std::chrono::seconds>(end - begin).count() << "[s]" << std::endl;
    std::cout << "Done";
}

CodePudding user response：

Sure, you can do this in O(n^2) time using only the amount of memory required to hold two lines, a boolean flag, and two file offsets.

The basic algorithm would be:

Open input file
Open output file
Set flag to false
Set pre-read offset to 0
While more input:
- Read a first input line
- Save current input file offset as post-read offset
- Seek input file to offset 0
- While current input file offset is less than pre-read offset:
  - Read a second input line
  - If first input line equals second input line:
    - Set flag to true
    - Break
- If flag is false:
  - Write first input line to output file
- Set flag to false
- Seek input file to post-read offset
- Set pre-read offset to post-read offset

This is, of course, extremely time-inefficient, but it's about as memory-efficient as you can get.

Possible C implementation:

std::ifstream input(inputFilePath);
std::ofstream output(outputFilePath);

std::streamoff offset_before = 0;
std::streamoff offset_after = 0;
bool found_dupe = false;

std::string line1;
while (std::getline(input, line1)) {
    offset_after = input.tellg();

    input.seekg(0);
    std::string line2;
    while (input.tellg() < offset_before && std::getline(input, line2)) {
        if (line1 == line2) {
            found_dupe = true;
            break;
        }
    }

    if (!found_dupe) {
        output << line1 << '\n';
    }

    found_dupe = false;
    input.seekg(offset_after);
    offset_before = offset_after;
}

Live Demo

CodePudding user response：

I think you could implement this without allocating any additional RAM, if you're allowed to modify the input file:

Do an in-place alphabetical sort of lines in the input-file (there are algorithms that can do this efficiently, perhaps using mmap() to access the file's contents as if it was in RAM; although it will be much easier to implement the sort if your input file's text-lines are all the same length)
Once the contents of the file has been sorted, you can do a single linear pass through the sorted data, and only output lines that differ from their predecessor.