Home > Blockchain >  Computing hamming distances on large data set
Computing hamming distances on large data set

Time:05-01

I have an input file of about 10^5 rows.
Each row is a sequence of 24 bits, i.e.:

1 0 1 1 1 0 1 0 1 0 1 1 1 0 1 0 1 0 1 1 1 0 1 0

I need to compute the Hamming Distance for every pair of rows. Here's my first implementation using SciPy hamming function:

from scipy.spatial.distance import hamming

with open('input.txt', 'r') as file:
    reader = csv.reader(file, delimiter=' ')
    nodes = {}
    b = 24  # Number of bits
    for nodeNum, node in enumerate(reader):
        node[nodeNum] = [int(i) for i in node]

for u, uBits in nodes.items():
    for v, vBits in nodes.items():
        distance = hamming(uBits, vBits) * b
            # Do stuff

Second implementation I came up with:

node[nodeNum] = sum([int(bit)*2**power for power, bit in enumerate(node)])

Here I only store the decimal value but I then have to manually count set bits resulting from each XOR operation:

def hamming(a, b):
    N = a ^ b

    distance = 0
    ptr = 1

    while N:
        distance  = ((N   1) //
                           2 * ptr)
        N -= (N   1) // 2
        ptr  = 1

    return distance

How can I improve my code (both in terms of memory usage and runtime, ideally)?

CodePudding user response:

Why not just put the whole .csv into an array and then let scipy do all the work of computing pairwise distances?

import numpy as np
import pandas as pd
import scipy.spatial.distance

nodes = []
with open('input.txt', 'r') as file:
    reader = csv.reader(file, delimiter=' ')
    for node in reader:
        nodes.append([int(i) for i in node])

nodes = np.array(nodes)  # not strictly necessary
dists = scipy.spatial.distance.pdist(nodes, 'hamming')

CodePudding user response:

This may be the fastest you can do (watchout, for your data size it'd take above 2 hrs):

import numpy as np

nodes = []
with open('input.txt', 'r') as file:
    reader = csv.reader(file, delimiter=' ')
    for node in reader:
        nodes.append([int(i) for i in node])

dists = 2 * np.inner(nodes-0.5, 0.5-nodes)   nodes.shape[1] / 2
  • Related