I have the following data in Python:
list1=[[ENS_ID1,ENS_ID2,ENS_ID3], [ENS_ID10,ENS_ID24,ENS_ID30] , ....]
mapping (a dataframe where in the first column I have an Ensemble gene ID and in the second column the corresponding MGI gene ID)
ENS_ID | MGI_ID |
---|---|
ENS_ID1 | MGI_ID1 |
ENS_ID2 | MGI_ID2 |
I'm trying to obtain another list of lists where instead of the ENS_ID I have the MGI_ID. To map the IDs I'm using a for cycle nested inside another one, but obviously, it's really slow as an approach. How can I speed it up? Here's the code:
for l in ens_lists:
mgi = []
for i in l:
mgi.append(mapping['MGI_ID'][mapping[mapping['ENSEMBL_ID']==i].index].values[0])
mgi_lists.append(mgi)
CodePudding user response:
As a quick solution you can try using listcomp instead of append, which should be faster:
mgi_lists = [[mapping['MGI_ID'][mapping[mapping['ENSEMBL_ID']==i].index].values[0] for i in l] for l in ens_lists]
Some explanations of why listcomp is faster are here
CodePudding user response:
The best solution is to create a fast data structure with only the lookup values, I mean a key/value, a dict can be very fast. After that, you must walk on the inputs and create the lookup-ed version.
import pandas as pd
list1=[['ENS_ID1','ENS_ID2','ENS_ID3'], ['ENS_ID10','ENS_ID3','ENS_ID2'] ]
mapping = pd.DataFrame({'ENS_ID':['ENS_ID1','ENS_ID2','ENS_ID3','ENS_ID10'], 'MGI_ID':['MGI_ID1','MGI_ID2','MGI_ID2','MGI_ID10']})
lookup = dict(mapping[['ENS_ID','MGI_ID']].values)
# This is superfast
mapped_list = []
for l in list1:
mapped_list.append([lookup[v] for v in l])
print(mapped_list)
# [['MGI_ID1', 'MGI_ID2', 'MGI_ID2'], ['MGI_ID10', 'MGI_ID2', 'MGI_ID2']]
ps: please correct the question with working code.