I have an input file of about 10^5 rows.
Each row is a sequence of 24 bits, i.e.:
1 0 1 1 1 0 1 0 1 0 1 1 1 0 1 0 1 0 1 1 1 0 1 0
I need to compute the Hamming Distance for every pair of rows. Here's my first implementation using SciPy hamming function:
from scipy.spatial.distance import hamming
with open('input.txt', 'r') as file:
reader = csv.reader(file, delimiter=' ')
nodes = {}
b = 24 # Number of bits
for nodeNum, node in enumerate(reader):
node[nodeNum] = [int(i) for i in node]
for u, uBits in nodes.items():
for v, vBits in nodes.items():
distance = hamming(uBits, vBits) * b
# Do stuff
Second implementation I came up with:
node[nodeNum] = sum([int(bit)*2**power for power, bit in enumerate(node)])
Here I only store the decimal value but I then have to manually count set bits resulting from each XOR operation:
def hamming(a, b):
N = a ^ b
distance = 0
ptr = 1
while N:
distance = ((N 1) //
2 * ptr)
N -= (N 1) // 2
ptr = 1
return distance
How can I improve my code (both in terms of memory usage and runtime, ideally)?
CodePudding user response:
Why not just put the whole .csv into an array and then let scipy
do all the work of computing pairwise distances?
import numpy as np
import pandas as pd
import scipy.spatial.distance
nodes = []
with open('input.txt', 'r') as file:
reader = csv.reader(file, delimiter=' ')
for node in reader:
nodes.append([int(i) for i in node])
nodes = np.array(nodes) # not strictly necessary
dists = scipy.spatial.distance.pdist(nodes, 'hamming')
CodePudding user response:
This may be the fastest you can do (watchout, for your data size it'd take above 2 hrs):
import numpy as np
nodes = []
with open('input.txt', 'r') as file:
reader = csv.reader(file, delimiter=' ')
for node in reader:
nodes.append([int(i) for i in node])
dists = 2 * np.inner(nodes-0.5, 0.5-nodes) nodes.shape[1] / 2