I have 2 dataframes like these:
df1
Alias seq_RNA
Hsa-Mir-133-P2-v1-5p AGCTGGTAAAATGGAACCAAATC
Hsa-Mir-143-P1-v2-3p TTGGTCCCCTTCAACCAGCTGT
Hsa-Mir-183-P1-v1-3p TTTGGTCCCCTTCAACCAGCTGT
Hsa-Mir-490-3p CAACCTGGAGGACTCCATGCTGT
Hsa-Mir-499-5p TTAAGACTTGCAGTGATGTTTA
Hsa-Mir-163-P2-v2-3p TTGGTCCCCTTCAACCAGCTA
Hsa-Mir-122-as-5p TTTAGTGTGATAATGGCGTTTG
Hsa-Mir-1-P1-5p ACATACTTCTTTATATGCCCATA
df2
Alias MiRBase_ID
Hsa-Mir-4-P1 hsa-let-133
Hsa-Let-7-P1c hsa-let-7c
Hsa-Let-7-P1d hsa-let-7a-2
Hsa-Mir-183-P1 hsa-let-183
Hsa-Mir-122-as hsa-let-122
Hsa-Let-7-P2a3 hsa-let-7f-2
I want to generate a new df3 where if there is a substring match of one of the alias between df1 and df2 it create a new column adding the corresponding MiRBase_ID, like this:
Alias seq_RNA MiRBase_ID
Hsa-Mir-133-P2-v1-5p AGCTGGTAAAATGGAACCAAATC na
Hsa-Mir-143-P1-v2-3p TTGGTCCCCTTCAACCAGCTGT na
Hsa-Mir-183-P1-v1-3p TTTGGTCCCCTTCAACCAGCTGT hsa-let-183
Hsa-Mir-490-3p CAACCTGGAGGACTCCATGCTGT na
Hsa-Mir-499-5p TTAAGACTTGCAGTGATGTTTA na
Hsa-Mir-163-P2-v2-3p TTGGTCCCCTTCAACCAGCTA na
Hsa-Mir-122-as-5p TTTAGTGTGATAATGGCGTTTG hsa-let-122
Hsa-Mir-1-P1-5p ACATACTTCTTTATATGCCCATA na
So basically when it finds a match or even a partial match (but the best match possible) of one alias in df1 in df2 it takes the MiRBase_ID and put it in the new df3. otherwhise put na if no match is found. there is no order in the 2 dataframes so each of the best match needs to be searched in the whole Alias column and df1 and df2 are of different nr of rows.
CodePudding user response:
Use str.extract
to get the substring and merge
:
import re
pattern = '|'.join(df2['Alias'].map(re.escape))
out = df1.merge(df2.drop(columns='Alias'),
left_on=df1['Alias'].str.extract(f'({pattern})', expand=False),
right_on=df2['Alias'], how='left').drop(columns='key_0')
Output:
Alias seq_RNA MiRBase_ID
0 Hsa-Mir-133-P2-v1-5p AGCTGGTAAAATGGAACCAAATC NaN
1 Hsa-Mir-143-P1-v2-3p TTGGTCCCCTTCAACCAGCTGT NaN
2 Hsa-Mir-183-P1-v1-3p TTTGGTCCCCTTCAACCAGCTGT hsa-let-183
3 Hsa-Mir-490-3p CAACCTGGAGGACTCCATGCTGT NaN
4 Hsa-Mir-499-5p TTAAGACTTGCAGTGATGTTTA NaN
5 Hsa-Mir-163-P2-v2-3p TTGGTCCCCTTCAACCAGCTA NaN
6 Hsa-Mir-122-as-5p TTTAGTGTGATAATGGCGTTTG hsa-let-122
7 Hsa-Mir-1-P1-5p ACATACTTCTTTATATGCCCATA NaN