Create 'correlation matrix' for two lists to check if the values have something in common-CodePudding

Could someone please help me out with the following?

I have one dataframe with two columns: products and webshops (n x 2) with n products. Now I would like to obtain a binary (n x n) matrix with all products listed as the indices and all products listed as the column names. Then each cell should contain a 1 or 0 denoting whether the product in the index and column name came from the same webshop.

The following code is returning what I would like to achieve.

dist = np.empty((len(df_title), len(df_title)), int)

for i in range(0,len(df_title)):
    for j in range(0,len(df_title)):
            boolean = df_title.values[i][1] == df_title.values[j][1]
            dist[i][j] = boolean  
df = pd.DataFrame(dist)

However, this code takes quite a significant time already for n = 1624. Therefore I was wondering if someone would have an idea for a faster algorithm.

Thanks!

CodePudding user response：

It seems like you're only interested in the element at position 1 for every column anyways, so creating a temp-variable for easier lookup could help:

lookup = df_title.values[:, 1]

Also since you want to interpret the resulting matrix as bool-matrix, you should probably specify dtype=bool (1 byte per field) instead of dtype=int (8 bytes per field), which also cuts down memory consumption by 8.

dist = np.empty((len(df_title), len(df_title)), dtype=bool)

Your matrix will be symmetric along the diagonal anyways, so you only need to compute "half" of the matrix, also if i == j we know the corresponding field in the matrix should be True.

lookup = df_title.values[:, 1]
dist = np.empty((len(df_title), len(df_title)), dtype=bool)

for i in range(len(df_title)):
    for j in range(len(df_title)):
        if i == j:
            # diagonal
            dist[i, j] = True
        else:
            # symmetric along diagonal
            dist[i, j] = dist[j, i] = lookup[i] == lookup[j]

Also using numpy-broadcasting you could actually transform all of that into a single line of code, that is orders of magnitude faster than the double-for-loop solution:

lookup = df_title.values[:, 1]
dist = lookup[None, :] == lookup[:, None]