I have a (very large) pandas dataframe like the following:
Sequence |
---|
AAAAAAAAAAAAAAAAAAAAAAAAA |
AAAAAAAAAAAAAAAAAAAAAAAAC |
AAAAAAAAAAAAAAAAAAAAAAAAG |
AAAAAAAAAAAAAAAAAAAAAAAAT |
AAAACAGAAGGTGTCCCAATACTAT |
AAAACAGATCTCGGCAGATTGGATG |
AAAACAGATCTCGGTAGACTGGACG |
And I want to remove rows where the percentage of A is greater than 0.80. Here is my code:
sequences = file[['Sequence']]
seq_A = 'A' * 25
for row in range(len(file)):
par1 = file.iloc[row,0]
# compare sequence with homopolymer and check ratio of match
ratioA = difflib.SequenceMatcher(None, par1, seq_A).ratio()
if ratioA >= 0.80:
sequences.drop(row, axis=0, inplace=True)
# lista.append(row)
But when I check the number of rows with such features with a new list in which I have inserted the indices (without deleting rows), the number of indices does not match the number of deleted rows. Thank you very much!
CodePudding user response:
You should generally avoid loops with pandas. Here is how you can do it:
df.loc[df['Sequence'].str.count('A') / df['Sequence'].str.len() <= 0.8]
produces:
Sequence
4 AAAACAGAAGGTGTCCCAATACTAT
5 AAAACAGATCTCGGCAGATTGGATG
6 AAAACAGATCTCGGTAGACTGGACG