Code was initially in R, but as R does not handle large dataset well, I converted the code to python and ported it to Google Colab. Even on Google Colab it took very long, and I never actually saw it finish runing even after 8 hours. I also added more breaking statements to avoid unnecessary runs.
The dataset has around unique 50000 time stamps, unique 40000 ids. It is in the format of ['time','id','x-coordinate','y-coordinate], very clear cut passenger trajectory dataset.
What the code is trying to do is extract out all the pairs of IDs which are 2 meters/less apart from each other at the same time frame.
Please let me know if there are ways to optimize this.
i=0
y = pd.DataFrame(columns=['source', 'dest']) #empty contact network df
infectedGrp = [824, 11648, 23468]
while (i < my_data.shape[0]):
row1=my_data.iloc[i]
id1=row1[1]
time1=row1[0]
x1=row1[2]
y1=row1[3]
infected1=my_data.iloc[i,4]
infectious1=my_data.iloc[i,5]
#print(row1)
#print(time1)
for j in range(i 1,my_data.shape[0]):
row2=my_data.iloc[j]
id2=row2[1]
time2=row2[0]
x2=row2[2]
y2=row2[3]
infected2=my_data.iloc[j,4]
infectious2=my_data.iloc[j,5]
print(time2)
if(time2!=time1):
i=i 1
print("diff time...breaking")
break
if(x2>x1 2) or (x1>x2 2):
i=i 1
print("x more than 2...breaking")
break
if(y2>y1 2) or (y1>y2 2):
i=i 1
print("y more than 2...breaking")
break
probability = 0
distance = round(math.sqrt(pow((x1-x2),2) pow((y1-y2),2)),2)
print(distance)
print(infected1)
print(infected2)
if (distance<=R):
if infectious1 and not infected2 : #if one person is infectious and the other is not infected
probability = (1-beta)*(1/R)*(math.sqrt(R**2-distance**2))
print(probability)
print("here")
infected2=decision(probability)
numid2= int(id2) # update all entries for id2
if (infected2):
my_data.loc[my_data['id'] == numid2, 'infected'] = True
#my_data.iloc[j,7]=probability
elif infectious2 and not infected1:
infected1=decision(probability)
numid1= int(id1) # update all entries for id1
if (infected1):
my_data.loc[my_data['id'] == numid1, 'infected'] = True
#my_data.iloc[i,7]=probability
inf1 = 'F'
inf2 = 'F'
if (infected1):
inf1 = 'T'
if (infected2):
inf2 = 'T'
print('prob ' str(probability) ' at time ' str(time1))
new_row = {'source': id1.astype(str) ' ' inf1, 'dest': id2.astype(str) ' ' inf2}
y = y.append(new_row, ignore_index=True)
i=i 1
CodePudding user response:
Its hard to tell now for sure, but I think good guess is this line is your biggest "sin":
y = y.append(new_row, ignore_index=True)
You should not append rows to dataframe in a loop.
You should aggregate them in python list and then create DataFrame using all of them after the loop.
y = []
while (i < my_data.shape[0])
(...)
y.append(new_row)
y = pd.DataFrame(y)
I also suggest to use line profiler to analyse which parts of the code are the bottlenecks
CodePudding user response:
You are using a nested loop to find time values that are equivalent. You can get a huge improvement by doing a group_by
operation instead and then iterating through the groups.