I want to extract the subsequences indicated by the first and last locations in data frame 'B'. The algorithm that I came up with is:
- Identify the rows of B that fall in the locations of A
- Find the relative position of the locations (i.e. shift the locations to make them start from 0)
- Start a for loop using the relative position as a range to extract the subsequences.
The issue with the above algorithm is runtime. I require an alternative approach to compile the code faster than the existing one.
Desired output:
first last sequences
3 5 ACA
8 12 CGGAG
105 111 ACCCCAA
115 117 TGT
Used data frames:
import pandas as pd
A = pd.DataFrame({'first.sequence': ['AAACACCCGGAG','ACCACACCCCAAATGTGT'
],'first':[1,100], 'last':[12,117]})
B = pd.DataFrame({'first': [3,8,105,115], 'last':[5,12,111,117]})
CodePudding user response:
One solution could be as follows:
out = pd.merge_asof(B, A, on=['last'], direction='forward',
suffixes=('','_y'))
out.loc[:,['first','last']] = \
out.loc[:,['first','last']].sub(out.first_y, axis=0)
out = out.assign(sequences=out.apply(lambda row:
row['first.sequence'][row['first']:row['last'] 1],
axis=1)).drop(['first.sequence','first_y'], axis=1)
out.update(B)
print(out)
first last sequences
0 3 5 ACA
1 8 12 CGGAG
2 105 111 ACCCCAA
3 115 117 TGT
Explanation
- First, use
df.merge_asof
to matchfirst
values fromB
withfirst
values fromA
. I.e.3, 8
will match with1
, and105, 115
will match with100
. Now we know which string (sequence) needs splitting and we also know where the string starts, e.g. at index1
or100
instead of a normal0
. - We use this last bit of information to find out where the string slice should start and end. So, we do
out.loc[:,['first','last']].sub(out.first_y, axis=0)
. E.g. we "reset"3
to2
(minus 1) and105
to5
(minus 100). - Now, we can use
df.apply
to get the string slices for each sequence, essentially looping over each row. (if your slices would have started and ended at the same indices, we could have usedSeries.str.slice
instead. - Finally, we assign the result to
out
(as colsequences
), drop the cols we no longer need, and we usedf.update
to "reset" the columnsfirst
andlast
.