Could someone please help me out with the following?
I have one dataframe with two columns: products and webshops (n x 2) with n products. Now I would like to obtain a binary (n x n) matrix with all products listed as the indices and all products listed as the column names. Then each cell should contain a 1 or 0 denoting whether the product in the index and column name came from the same webshop.
The following code is returning what I would like to achieve.
dist = np.empty((len(df_title), len(df_title)), int)
for i in range(0,len(df_title)):
for j in range(0,len(df_title)):
boolean = df_title.values[i][1] == df_title.values[j][1]
dist[i][j] = boolean
df = pd.DataFrame(dist)
However, this code takes quite a significant time already for n = 1624. Therefore I was wondering if someone would have an idea for a faster algorithm.
Thanks!
CodePudding user response:
It seems like you're only interested in the element at position 1
for every column anyways, so creating a temp-variable for easier lookup could help:
lookup = df_title.values[:, 1]
Also since you want to interpret the resulting matrix as bool
-matrix, you should probably specify dtype=bool
(1 byte per field) instead of dtype=int
(8 bytes per field), which also cuts down memory consumption by 8.
dist = np.empty((len(df_title), len(df_title)), dtype=bool)
Your matrix will be symmetric along the diagonal anyways, so you only need to compute "half" of the matrix, also if i == j
we know the corresponding field in the matrix should be True
.
lookup = df_title.values[:, 1]
dist = np.empty((len(df_title), len(df_title)), dtype=bool)
for i in range(len(df_title)):
for j in range(len(df_title)):
if i == j:
# diagonal
dist[i, j] = True
else:
# symmetric along diagonal
dist[i, j] = dist[j, i] = lookup[i] == lookup[j]
Also using numpy-broadcasting you could actually transform all of that into a single line of code, that is orders of magnitude faster than the double-for
-loop solution:
lookup = df_title.values[:, 1]
dist = lookup[None, :] == lookup[:, None]