I have a CSV file where there are 15000 mac adresses and I would like to find the vendor mapping for each of these mac addresses.
I have a json file where I can find all the vendor mapping by mac prefix like this:
oui.json
{
"VendorMapping": [
{
"_mac_prefix": "00:00:00",
"_vendor_name": "XEROX CORPORATION"
},
{
"_mac_prefix": "00:00:01",
"_vendor_name": "XEROX CORPORATION"
},
{
"_mac_prefix": "00:00:02",
"_vendor_name": "XEROX CORPORATION"
},
{
"_mac_prefix": "00:00:03",
"_vendor_name": "XEROX CORPORATION"
},
{
"_mac_prefix": "00:00:04",
"_vendor_name": "XEROX CORPORATION"
},
....
I started my script by making a foor loop in my CSV and a foor loop in the json to find the matching vendor mapping for each mac address of my CSV:
import json
import os
from time import time
start = time()
f = open("oui.json")
data = json.load(f)
file = open("data.csv")
content = file.readlines()[1:]
for line in content:
mac = line.split(',')[1]
print(mac)
for oui in data["VendorMapping"]:
if mac.upper().startswith(oui["_mac_prefix"]):
print(oui["_vendor_name"])
break
print(f'Total time: {time() - start}')
It took me 49 seconds to get all the vendor mapping for all the mac adresses. But I want to make it much faster.
For that, I decided to use asyncIO like that:
import json
import asyncio
import os
from time import time
start = time()
f = open('oui.json')
data = json.load(f)
file = open("api/data.csv")
content = file.readlines()[1:]
tasks = []
async def vendormapping(line):
mac = line.split(',')[1]
print(mac)
for oui in data["VendorMapping"]:
if mac.upper().startswith(oui["_mac_prefix"]):
print(oui["_vendor_name"])
break
async def main():
for line in content:
tasks.append(vendormapping(line))
await asyncio.gather(*tasks)
asyncio.run(main())
print(f"All took {time() - start}")
I think I'm doing something wrong because it's taking 39 seconds to proceed. I was expecting something faster. Can someone please help me?
Thank you,
CodePudding user response:
asyncio is a single threaded IO. As in even though you're async you're not actually processing anything in parallel. AsyncIO is more used when there is a lot of waiting involved in your workloads. In this scenario there is no waiting. its just pure throughput.
Ive refactored your code to use multiprocessing instead. Which is more suited to raw throughput than AsyncIO. In my testing i cut the time down from 15 seconds to 4. (using spoofed data i guessed)
def vendormapping(line):
mac = line.split(',')[1]
print(mac)
for oui in data["VendorMapping"]:
if mac.upper().startswith(oui["_mac_prefix"]):
print(oui["_vendor_name"])
break
def main():
with Pool() as p:
p.map(vendormapping, content)
if __name__ == '__main__':
main()
print(f"All took {time() - start}")