I have the following df. The number of observations run upto hundreds of millions. I have a series that consists of a large number (around 15 million) of addresses. I want to extract rows if either from or to address can be found in the series.
from_address to_address value block_timestamp
0 0xcf1cc6s 0x272d9fa 1.01485 2018-06-03 09:13:22 00:00
1 0xbfb92c3 0x24d941b 3.097 2018-06-03 09:13:22 00:00
2 0x58076fd 0x5b89cb8 1.9089 2018-06-03 09:13:22 00:00
3 0x0000000 0xbe80455 7.71195 2018-06-03 09:13:22 00:00
4 0x0000000 0x334f5b0 7.71133 2018-06-03 09:13:22 00:00
5 0x0000000 0x9b54db8 1e 21 2018-06-03 09:13:22 00:00
6 0x84b06bd 0x2a0c0db 2.185 2018-06-03 10:55:46 00:00
7 0x59a5202 0x3122da2 1633277 2018-06-03 10:55:46 00:00
8 0x59a520a 0x02d9112 1.606 2018-06-03 10:55:46 00:00
9 0x5e57d43 0xd64d144 1.472 2018-06-03 01:02:14 00:00
My code:
series= list['address'].str.lower()
df_extracted = df[df['from_address'].isin(series)
| df['to_address'].isin(series)]
But somehow, when I look at the original df and look at all the rows which contains a specific address, i get 136 rows, whereas if i do the same with df_extracted, i only get 12. So when i am extracting rows using the above code, I am doing something wrong. Does anyone have any clue what I am doing wrong?
CodePudding user response:
This worked for me:
addresses = ['0x0000000', '0x2a0c0db']
df.query('from_address in @addresses or to_address in @addresses')
It's supposed to work the same way as your code so maybe check your dataframe just before you apply this operation.
The result I'm getting:
from_add to_add value block_timestamp
index
3 0x0000000 0xbe80455 7.71195 2018-06-03 09:13:22 00:00
4 0x0000000 0x334f5b0 7.71133 2018-06-03 09:13:22 00:00
5 0x0000000 0x9b54db8 1e 21 2018-06-03 09:13:22 00:00
6 0x84b06bd 0x2a0c0db 2.185 2018-06-03 10:55:46 00:00