Deduping 28 Million Strings Using JavaScript-CodePudding

I have a JSON file with 28 million strings, the size is roughly 15 GB. Some of the strings are duplicates, so I need to create a new JSON file with just the unique ones. I'm guessing 240 million of them are unique, but I need to find out the exact number. The strings are all less than 100 characters. Here is an example of the data:

[
    '4zWMS2IHAKcsrVtrUBFXIFjkwvbiCyiK',
    'btFqRsglI1Dh81jpgmnRhKPGIBbe2cU7',
    '8Us6mE6zWfyOpjhXsJssE65LrOFc7yr6',
    ...
]

My first approach was to create a JavaScript object and set all of the keys of the object to the strings. Then I would check the length of the keys and that would be my unique count. Unfortunately, I ran into a limit and JavaScript objects can only have ~8M keys.

My next approach was to create a new array in JavaScript and then iterate through my strings and then use the .indexOf method to see if I already added the string to my array. Unfortunately, this is way too slow.

Can anyone think of a way I can do this in JavaScript? I am also OK switching to a different language if this is not the right tool for the job.

CodePudding user response：

Computers are insane. I was inspired by the comments to shard out the keys to many objects so I would not hit the 8M key limit in JavaScript. I then used GitHub Copilot to write this code in about 30 seconds, just using comments and having Copilot write the rest:

// Find the files we want to count
let files = NodeFileSystem.readdirSync('data/imported');
let keyArray = [];
let keyArrayLength = 0;

// Add all of the keys to an array
for(let file of files) {
    if(file.endsWith('.json')) {
        console.log('Parsing file', file);
        let data = JSON.parse(NodeFileSystem.readFileSync('data/imported/' file));

        // Add the data item.key to the array
        for(let item of data) {
            keyArray.push(item.key);
            keyArrayLength  ;
        }
    }
}
console.log('Total array length:', keyArrayLength);

// JavaScript will only allow us to have 8 million keys in an object, so we need to shard the array
// into several objects, using the first characters of each key
let characterCountToUseForSharding = 2;

// An object to store the sharded objects
let shardedObjects = {};

// Loop through the key array
let processedCount = 0;
for(let key of keyArray) {
    processedCount  ;
    if(processedCount % 1000000 === 0) {
        let processCountWithCommas = processedCount.toLocaleString();
        console.log('Processed', processCountWithCommas, 'of', keyArrayLength.toLocaleString());
    }

    // Get the first characterCountToUseForSharding characters of the key
    let shardingKey = key.substring(0, characterCountToUseForSharding);

    // If the sharded object doesn't exist, create it
    if(!shardedObjects[shardingKey]) {
        shardedObjects[shardingKey] = {};
        // console.log('Created sharded object', shardingKey);
    }

    // Add the key to the sharded object
    shardedObjects[shardingKey][key] = true;
}

// Count the keys in each sharded object
let total = 0;
for(let shardingKey in shardedObjects) {
    let keyCount = Object.keys(shardedObjects[shardingKey]).length;
    console.log('Sharding key', shardingKey, 'has', keyCount, 'keys');
    total  = keyCount;
}

// Log the total
console.log('Total keys:', keyArrayLength);
console.log('Total unique keys:', total);

// Percentage
let percentage = (total / keyArrayLength) * 100;
console.log('Unique percentage:', percentage);

// Duplicate keys
let duplicateKeys = keyArrayLength - total;
console.log('Duplicate keys:', duplicateKeys);

Here you can see the output and speed:

https://www.youtube.com/watch?v=RurzxMjbazg

I am running it on an M1 MacBook Pro and just amazed at the speed. The fan did not even turn on.

CodePudding user response：

I don't have a ready-to-use solution, but I can drop here a bunch of advice that could drive you to an efficient solution:

If you are using NodeJS, consider writing a C addon. Indeed, with C you can write low-level code that, according to your capacity, could lead to faster and more efficient software. Writing C code will allow you to use the correct data structure for this task, manage memory efficiently and overcome many of the C limitations
Work with file units. I think that the file system is the key to your problem. In other words, you can apply an on-disk sorting to your file (e.g. the Linux sort command), then you can split your 15GB file into, let's say, 200MB chunks. You can now load each chunk with NodeJS and remove duplicates, paying attention that the last row of the nth file can find a duplicate in the (n 1)-th file.
(bonus) Exploit all the threads. Even if this alone won't solve the memory problems, when combined with the multi-chunks solution it will lead to faster code