How to speed up this Pandas for loop-CodePudding

I have the following data in Python:

list1=[[ENS_ID1,ENS_ID2,ENS_ID3], [ENS_ID10,ENS_ID24,ENS_ID30] , ....]

mapping (a dataframe where in the first column I have an Ensemble gene ID and in the second column the corresponding MGI gene ID)

ENS_ID	MGI_ID
ENS_ID1	MGI_ID1
ENS_ID2	MGI_ID2

I'm trying to obtain another list of lists where instead of the ENS_ID I have the MGI_ID. To map the IDs I'm using a for cycle nested inside another one, but obviously, it's really slow as an approach. How can I speed it up? Here's the code:

for l in ens_lists:
  mgi = []
  for i in l:
      mgi.append(mapping['MGI_ID'][mapping[mapping['ENSEMBL_ID']==i].index].values[0])
  mgi_lists.append(mgi)

CodePudding user response：

As a quick solution you can try using listcomp instead of append, which should be faster:

mgi_lists = [[mapping['MGI_ID'][mapping[mapping['ENSEMBL_ID']==i].index].values[0] for i in l] for l in ens_lists]

Some explanations of why listcomp is faster are here

CodePudding user response：

The best solution is to create a fast data structure with only the lookup values, I mean a key/value, a dict can be very fast. After that, you must walk on the inputs and create the lookup-ed version.

import pandas as pd

list1=[['ENS_ID1','ENS_ID2','ENS_ID3'], ['ENS_ID10','ENS_ID3','ENS_ID2'] ] 

mapping = pd.DataFrame({'ENS_ID':['ENS_ID1','ENS_ID2','ENS_ID3','ENS_ID10'], 'MGI_ID':['MGI_ID1','MGI_ID2','MGI_ID2','MGI_ID10']})
    
lookup = dict(mapping[['ENS_ID','MGI_ID']].values)

# This is superfast
mapped_list = []
for l in list1:
    mapped_list.append([lookup[v] for v in l])

print(mapped_list)
# [['MGI_ID1', 'MGI_ID2', 'MGI_ID2'], ['MGI_ID10', 'MGI_ID2', 'MGI_ID2']]

ps: please correct the question with working code.