Home > OS >  Create 'correlation matrix' for two lists to check if the values have something in common
Create 'correlation matrix' for two lists to check if the values have something in common

Time:12-04

Could someone please help me out with the following?

I have one dataframe with two columns: products and webshops (n x 2) with n products. Now I would like to obtain a binary (n x n) matrix with all products listed as the indices and all products listed as the column names. Then each cell should contain a 1 or 0 denoting whether the product in the index and column name came from the same webshop.

The following code is returning what I would like to achieve.

dist = np.empty((len(df_title), len(df_title)), int)

for i in range(0,len(df_title)):
    for j in range(0,len(df_title)):
            boolean = df_title.values[i][1] == df_title.values[j][1]
            dist[i][j] = boolean  
df = pd.DataFrame(dist)

However, this code takes quite a significant time already for n = 1624. Therefore I was wondering if someone would have an idea for a faster algorithm.

Thanks!

CodePudding user response:

It seems like you're only interested in the element at position 1 for every column anyways, so creating a temp-variable for easier lookup could help:

lookup = df_title.values[:, 1]

Also since you want to interpret the resulting matrix as bool-matrix, you should probably specify dtype=bool (1 byte per field) instead of dtype=int (8 bytes per field), which also cuts down memory consumption by 8.

dist = np.empty((len(df_title), len(df_title)), dtype=bool)

Your matrix will be symmetric along the diagonal anyways, so you only need to compute "half" of the matrix, also if i == j we know the corresponding field in the matrix should be True.

lookup = df_title.values[:, 1]
dist = np.empty((len(df_title), len(df_title)), dtype=bool)

for i in range(len(df_title)):
    for j in range(len(df_title)):
        if i == j:
            # diagonal
            dist[i, j] = True
        else:
            # symmetric along diagonal
            dist[i, j] = dist[j, i] = lookup[i] == lookup[j]

Also using numpy-broadcasting you could actually transform all of that into a single line of code, that is orders of magnitude faster than the double-for-loop solution:

lookup = df_title.values[:, 1]
dist = lookup[None, :] == lookup[:, None]
  • Related