Performance issue with python - how to find and work on matching records from data in long format-CodePudding

Suppose you have a data frame AllData looking like:

     ID M   P
1  ID_4 D 4.9
2  ID_1 A 5.0
3  ID_3 D 5.0
4  ID_5 D 5.1
5  ID_2 B 5.2
6  ID_1 B 5.5
7  ID_3 A 6.0
8  ID_1 C 6.0
9  ID_4 B 6.3
10 ID_4 A 6.5
11 ID_5 A 7.0
12 ID_3 C 7.0
13 ID_5 C 8.0

For each pair of distinct values of M (order does not matter), you want to find the records having matching ID's, and calculate: the number of such records, the min and max value for their P's and the Spearman correlation coefficient between their P's.

E.g. if you cross-tabulate the above in R:

xtabs(P ~ ID   M, AllData, sparse = T)

      M
ID       A   B C   D
  ID_1 5.0 5.5 6 .  
  ID_2 .   5.2 . .  
  ID_3 6.0 .   7 5.0
  ID_4 6.5 6.3 . 4.9
  ID_5 7.0 .   8 5.1

you see that A and D have 3 ID's in common, with P going from 6.0 to 7.0 for A, and from 4.9 to 5.1 for D; the Spearman correlation coefficient is 0.5:

cor(c(6.0, 6.5, 7.0), c(5.0, 4.9, 5.1), method = 'spearman')
#[1] 0.5

I had coded this kind of calculations in R, and I was OK with the performance.
Now I tried to reproduce the same in python, and I suspect I did something wrong, because it's something like 10 times slower.
Most often python is at least as fast as R, if not faster, hence my puzzlement.

See below the code examples with timings.
Keeping in mind that the real data I am using is much sparser than this, I can report that the difference in performance is similar (about 2.5 min in R, about 20-25 min in python).

Would anyone be able to suggest improvements for the python version?

My impression is that the part where the ID's in common are determined is the critical one.
In that respect, I've already tried a few things, like completely bypassing the explicit identification of the records in common, i.e. making a sparse cross-tabulated matrix and using the nan_policy = 'omit' of spearmanr; or using np.isin, np.intersect1d and similar instead of the dict keys logical operation I used here.
None of these approaches improved the performance.

R code:

AllData = data.frame(
  "ID" = paste0('ID_', c(1,3,4,5,1,2,4,1,3,5,3,4,5)),
  "M" = c('A','A','A','A','B','B','B','C','C','C','D','D','D'),
  "P" = c(5,6,6.5,7,5.5,5.2,6.3,6,7,8,5,4.9,5.1)
)

AllData = AllData[order(AllData$P), ]

rownames(AllData) <- 1:(dim(AllData)[1])

#xtabs(P ~ ID   M, AllData)

# Replace ID with an index of its sorted elements
AllData['ID'] <- as.numeric(factor(AllData[['ID']]))

# Replace M with an index of its elements
M.u <- unique(AllData[['M']])
M.u.N <- length(M.u)
M.u <- sort(M.u)
AllData[['M']] <- as.numeric(factor(AllData[['M']], levels = as.character(M.u)))

# Sort AllData by the numeric index obtained from ID
AllData = AllData[order(AllData$M, AllData$ID),]

# Create a matrix of ID indices and matching P values, for each M.u
M.ind.pos <- unname(c(1, 1   cumsum(table(AllData[['M']]))))
ID.vs.M <- sapply(1:M.u.N,
                  FUN = function(ci) {
                    AllData[(M.ind.pos[ci]):(M.ind.pos[ci 1] - 1), c('ID', 'P')]})

# Calculate Spearman correlation and stats

ts <- numeric(0)

for (xx in 1:7) {
  
  start_time <- Sys.time()
  
  for (x in 1:1000) {
    
    cor.out <- vector("list", M.u.N * (M.u.N - 1) / 2)
    r <- 0
    
    for (i in 1:(M.u.N - 1)) {
      #print(paste0(i, "/", (M.u.N - 1), ", ", M.u[i]))
      rvi <- ID.vs.M[,i]
      ri <- rvi[[1]]
      vi <- rvi[[2]]
      Ni <- length(vi)
      
      for (j in (i 1):(M.u.N)) {
        rvj <- ID.vs.M[,j]
        rj <- rvj[[1]]
        vj <- rvj[[2]]
        Nj <- length(vj)
        
        ri_in_rj <- which(ri %in% rj)
        vi_ij <- vi[ri_in_rj]
        ri_ij <- ri[ri_in_rj]
        rj_in_ri <- which(rj %in% ri_ij)
        vj_ij <- vj[rj_in_ri]
        
        Nij <- length(ri_in_rj)
        
        Ni.frac <- round(Nij / Ni, 2)
        Nj.frac <- round(Nij / Nj, 2)
        
        vi.range <- as.numeric(c(NA, NA))
        vj.range <- as.numeric(c(NA, NA))
        vi.range.diff <- as.numeric(NA)
        vj.range.diff <- as.numeric(NA)
        cor.ij <- as.numeric(NA)
        
        if (Nij >= 1) {
          vi.range <- range(vi_ij)
          vj.range <- range(vj_ij)
          vi.range.diff <- diff(vi.range)
          vj.range.diff <- diff(vj.range)
        }
        
        if (Nij >= 3) {
          cor.ij <- cor(vi_ij, vj_ij, method = 'spearman')
        }
        
        # Add result to the list
        r <- r   1
        cor.out[[r]] <- c(i, j, Ni, Nj, Nij, vi.range, vi.range.diff, vj.range, vj.range.diff, cor.ij)
      }
    }
  }
  end_time <- Sys.time()
  ts <- c(ts, as.numeric(end_time - start_time))
}

# time in s for 1000 runs = time in ms for 1 run
print(paste0("7 x 1000 repeats: ", mean(ts), "  /- ", sd(ts), " ms"))

result:

7 x 1000 repeats: 0.364144154957363  /- 0.0251607071478284 ms

python code

import pandas as pd
import numpy as np
from scipy.stats import spearmanr

AllData = pd.DataFrame({
    "ID" : ['ID_'  str(i) for i in [1,3,4,5,1,2,4,1,3,5,3,4,5]],
    "M" : ['A','A','A','A','B','B','B','C','C','C','D','D','D'],
    "P" : [5,6,6.5,7,5.5,5.2,6.3,6,7,8,5,4.9,5.1]
})

AllData = AllData.sort_values('P').reset_index(drop = True)

# Make a numerical index for ID, as a dict
unique_IDs = AllData['ID'].unique()
index_vs_ID_dict = dict(zip(unique_IDs, range(len(unique_IDs))))

# Make a dict of (index, P) vs M
index_P_vs_M_dict = dict()

for M, ID, P in zip(AllData['M'].values, AllData['ID'].values, AllData['P'].values) :
    index = index_vs_ID_dict[ID]
    if M not in index_P_vs_M_dict :
        index_P_vs_M_dict[M] = [(index, P)]
    else :
        index_P_vs_M_dict[M].append((index, P))

# Turn each list of tuples in index_P_vs_M_dict into a dict of P vs index
for k in index_P_vs_M_dict :    
    index_P_vs_M_dict[k] = dict(index_P_vs_M_dict[k])

M_list = list(index_P_vs_M_dict.keys())

%%timeit

corr_table = pd.DataFrame()

for m, M1 in enumerate(M_list[:-1]) :  
  # Read the index_P dictionary of M1
  d1 = index_P_vs_M_dict[M1]
  # Extract the index of M1 as dict keys
  i1 = d1.keys()  
  N1 = len(i1)
  #print(M1, ", ", str(N1))

  for M2 in M_list[(m 1):] :
    # Read the index_P dictionary of M2
    d2 = index_P_vs_M_dict[M2]
    # Extract the index of M1 as dict keys
    i2 = d2.keys()
    N2 = len(i2)

    # Find the indices in common between M1 and M2    
    i12 = (i1 & i2)
    N12 = len(i12)
    
    # Compute the srcc and stats
    srcc = pd.NA
    if (N12 > 2) :
      a1 = [d1[index] for index in i12]
      a2 = [d2[index] for index in i12]
      min_a1 = min(a1)
      max_a1 = max(a1)
      range_a1 = max_a1 - min_a1
      min_a2 = min(a2)
      max_a2 = max(a2)
      range_a2 = max_a2 - min_a2
      if (range_a1 > 0) & (range_a2 > 0) :
        srcc = spearmanr(a1, a2)[0]
        #print(M2   ', # ID\'s in common = '   str(N12)   ', SRCC = '   str(srcc))                
    else:
      min_a1 = pd.NA
      max_a1 = pd.NA
      range_a1 = pd.NA        
      min_a2 = pd.NA
      max_a2 = pd.NA
      range_a2 = pd.NA      
    
    if True:
      # Store the results in corr_table
      corr_table = pd.concat([corr_table, pd.DataFrame({
        "M_1" : M1,
        "M_2" : M2,
        "N_1" : N1,
        "N_2" : N2,
        "N_common_IDs" : N12,
        "min_1" : min_a1,
        "max_1" : max_a1,
        "range_1" : range_a1,
        "min_2" : min_a2,
        "max_2" : max_a2,
        "range_2" : range_a2,
        "Spearman_cc" : srcc
      }, index = [0])], axis = 0)

result:

4.73 ms ± 11.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Looks like a bit more than 10 times slower than R, if I am not mistaken.

CodePudding user response：

I'm going to use a library of mine - convtools (github) to build the grouper.

from itertools import combinations

from scipy.stats import spearmanr
from convtools import conversion as c


input_data = {
    "ID": ["ID_"   str(i) for i in [1, 3, 4, 5, 1, 2, 4, 1, 3, 5, 3, 4, 5]],
    "M": ["A", "A", "A", "A", "B", "B", "B", "C", "C", "C", "D", "D", "D"],
    "P": [5, 6, 6.5, 7, 5.5, 5.2, 6.3, 6, 7, 8, 5, 4.9, 5.1],
}
# let's rewrite as list of dicts
prepared_data = [
    dict(zip(("ID", "M", "P"), tuple_))
    for tuple_ in zip(input_data["ID"], input_data["M"], input_data["P"])
]

# build the grouper
converter = (
    c.group_by(c.item("M"))
    .aggregate(
        {
            "M": c.item("M"),
            # this assumes you only have a single value per ID, otherwise
            # replace with DictArray and handle accordingly
            "id_to_p_values": c.ReduceFuncs.Dict(c.item("ID"), c.item("P")),
        }
    )
    .gen_converter()
)

# In [18]: converter(prepared_data)
# Out[18]:
# [{'M': 'A', 'id_to_p_values': {'ID_1': 5, 'ID_3': 6, 'ID_4': 6.5, 'ID_5': 7}},
#  {'M': 'B', 'id_to_p_values': {'ID_1': 5.5, 'ID_2': 5.2, 'ID_4': 6.3}},
#  {'M': 'C', 'id_to_p_values': {'ID_1': 6, 'ID_3': 7, 'ID_5': 8}},
#  {'M': 'D', 'id_to_p_values': {'ID_3': 5, 'ID_4': 4.9, 'ID_5': 5.1}}]


def calculate_pair(left, right):
    left_id_to_p_values = left["id_to_p_values"]
    right_id_to_p_values = right["id_to_p_values"]
    common_keys = set(left_id_to_p_values).intersection(right_id_to_p_values)
    left_p_values = [left_id_to_p_values[key] for key in common_keys]
    right_p_values = [right_id_to_p_values[key] for key in common_keys]

    min_1 = min(left_p_values)
    min_2 = min(right_p_values)
    max_1 = max(left_p_values)
    max_2 = max(right_p_values)
    return {
        "M1": left["M"],
        "M2": right["M"],
        "N1": len(left_id_to_p_values),
        "N2": len(right_id_to_p_values),
        "min_1": min_1,
        "min_2": min_2,
        "max_1": max_1,
        "max_2": max_2,
        "range_1": max_1 - min_1,
        "range_2": max_2 - min_2,
        "N_common_IDs": len(common_keys),
        "srcc": spearmanr(left_p_values, right_p_values)[0],
    }


%%timeit
results = [
    calculate_pair(left, right)
    for left, right in combinations(converter(prepared_data), 2)
]
# 457 µs ± 1.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# results == [
#     {'M1': 'A', 'M2': 'B', 'N1': 4, 'N2': 3, 'min_1': 5, 'min_2': 5.5, 'max_1': 6.5, 'max_2': 6.3, 'range_1': 1.5, 'range_2': 0.7999999999999998, 'N_common_IDs': 2, 'srcc': 0.9999999999999999},
#     {'M1': 'A', 'M2': 'C', 'N1': 4, 'N2': 3, 'min_1': 5, 'min_2': 6, 'max_1': 7, 'max_2': 8, 'range_1': 2, 'range_2': 2, 'N_common_IDs': 3, 'srcc': 1.0},
#     {'M1': 'A', 'M2': 'D', 'N1': 4, 'N2': 3, 'min_1': 6, 'min_2': 4.9, 'max_1': 7, 'max_2': 5.1, 'range_1': 1, 'range_2': 0.1999999999999993, 'N_common_IDs': 3, 'srcc': 0.5},
#     {'M1': 'B', 'M2': 'C', 'N1': 3, 'N2': 3, 'min_1': 5.5, 'min_2': 6, 'max_1': 5.5, 'max_2': 6, 'range_1': 0.0, 'range_2': 0, 'N_common_IDs': 1, 'srcc': nan},
#     {'M1': 'B', 'M2': 'D', 'N1': 3, 'N2': 3, 'min_1': 6.3, 'min_2': 4.9, 'max_1': 6.3, 'max_2': 4.9, 'range_1': 0.0, 'range_2': 0.0, 'N_common_IDs': 1, 'srcc': nan},
#     {'M1': 'C', 'M2': 'D', 'N1': 3, 'N2': 3, 'min_1': 7, 'min_2': 5, 'max_1': 8, 'max_2': 5.1, 'range_1': 1, 'range_2': 0.09999999999999964, 'N_common_IDs': 2, 'srcc': 0.9999999999999999}
# ]

Unfortunately we cannot compare my 457 µs to your 4.73ms as you timed only a part of your data processing procedure (and of course we have different hardware).