Python/ R code is taking too long to extract pairwise information from dataset. How to optimize?-CodePudding

Code was initially in R, but as R does not handle large dataset well, I converted the code to python and ported it to Google Colab. Even on Google Colab it took very long, and I never actually saw it finish runing even after 8 hours. I also added more breaking statements to avoid unnecessary runs.

The dataset has around unique 50000 time stamps, unique 40000 ids. It is in the format of ['time','id','x-coordinate','y-coordinate], very clear cut passenger trajectory dataset.

What the code is trying to do is extract out all the pairs of IDs which are 2 meters/less apart from each other at the same time frame.

Please let me know if there are ways to optimize this.

i=0
y = pd.DataFrame(columns=['source', 'dest']) #empty contact network df
infectedGrp = [824, 11648, 23468]

while (i < my_data.shape[0]):
  row1=my_data.iloc[i]
  id1=row1[1]
  time1=row1[0]
  x1=row1[2]
  y1=row1[3]
  infected1=my_data.iloc[i,4]
  infectious1=my_data.iloc[i,5]
  #print(row1)
  #print(time1)
  
  for j in range(i 1,my_data.shape[0]):
    row2=my_data.iloc[j]
    id2=row2[1]
    time2=row2[0]
    x2=row2[2]
    y2=row2[3] 
    infected2=my_data.iloc[j,4]
    infectious2=my_data.iloc[j,5]
    print(time2)
    
    if(time2!=time1):
      i=i 1
      print("diff time...breaking")
      break

    if(x2>x1 2) or (x1>x2 2):
      i=i 1
      print("x more than 2...breaking")
      break
    
    if(y2>y1 2) or (y1>y2 2):
      i=i 1
      print("y more than 2...breaking")
      break


    probability = 0
    distance = round(math.sqrt(pow((x1-x2),2) pow((y1-y2),2)),2)
    print(distance)
    print(infected1)
    print(infected2)
    if (distance<=R):
      if infectious1 and not infected2 : #if one person is infectious and the other is not infected 
        probability = (1-beta)*(1/R)*(math.sqrt(R**2-distance**2))
        print(probability)
        print("here")
        infected2=decision(probability)
        
        numid2= int(id2) # update all entries for id2
        
        if (infected2):
          my_data.loc[my_data['id'] == numid2, 'infected'] = True
        #my_data.iloc[j,7]=probability

      elif infectious2 and not infected1:
        infected1=decision(probability)
        
        numid1= int(id1) # update all entries for id1
        
        if (infected1):
          my_data.loc[my_data['id'] == numid1, 'infected'] = True

        #my_data.iloc[i,7]=probability
      
      inf1 = 'F'
      inf2 = 'F'
      
      if (infected1):
        inf1 = 'T'
      
      if (infected2):
        inf2 = 'T'
      
      print('prob ' str(probability) ' at time ' str(time1))
      new_row = {'source': id1.astype(str) ' ' inf1, 'dest': id2.astype(str) ' ' inf2}
      y = y.append(new_row, ignore_index=True)


  i=i 1

CodePudding user response：

Its hard to tell now for sure, but I think good guess is this line is your biggest "sin":

y = y.append(new_row, ignore_index=True)

You should not append rows to dataframe in a loop.

You should aggregate them in python list and then create DataFrame using all of them after the loop.

y = []
while (i < my_data.shape[0])
    (...)
    y.append(new_row)
y = pd.DataFrame(y)

I also suggest to use line profiler to analyse which parts of the code are the bottlenecks

CodePudding user response：

You are using a nested loop to find time values that are equivalent. You can get a huge improvement by doing a group_by operation instead and then iterating through the groups.