I've been working on a webscraper which runs asynchronously to substantially improve speed. Because of the way scraper works, it writes two pieces of information for a url which it scrapes. These are stored in two dictionaries:
# {url1: [parent1, parent2, parent3], url2: [parent1, parent2, parent3]...}
relationships = dict()
# {url1: [page, redirect, content_type...], url2: [page, redirect, content_type...]}
url_results = dict()
The first dictionary, relationships, tracks each occurrence of the key url within a parent url The second dictionary stores the resulting information I need from each url scraped.
This is all necessary for how I'm handling async threading and memorization, and works really nice. The problem lies in writing these results to an excel report. These dictionaries are getting to thousands of keys in size and merging them together is very slow.
The end result is that for each parent value for a url in relationships dict, a line is written in my excel sheet
parent | url | page | redirect | content type | etc |
---|---|---|---|---|---|
parent1 | url1 | page | redirect | content_type | etc... |
parent1 | url2 | page | redirect | content_type | etc... |
parent1 | url3 | page | redirect | content_type | etc... |
parent2 | url1 | page | redirect | content_type | etc... |
parent2 | url2 | page | redirect | content_type | etc... |
parent2 | url3 | page | redirect | content_type | etc... |
Here's the method I'm currently using to merge these dictionaries together like this:
partent_result_df = pd.DataFrame(columns=["Parent", "URL", "Page", "Redirect", "Content-type", "Source", "Status"])
location = 1
for i, key in enumerate(RELATIONSHIPS):
# result_arr = [Item, Redirect, Content_type, Source, Status]
try:
result_arr = URL_RESULTS[key]
except:
result_arr = [None, None, None, None, "Not Yet Accessed"]
parents_urls = RELATIONSHIPS[key]
for parent in parents_urls:
partent_result_df.loc[location] = [parent, key, *result_arr]
location = 1
return partent_result_df
I'm looking for a way to make this merge much faster if at all possible. It's currently taking quite a lot of time to write this report, almost as much as the scrape itself, and I think I'm missing something that could help speed up this process.
I'd appreciate any help finding a better way to do this. Sorry for the information overkill, and thank you greatly for any help!
CodePudding user response:
You can make 2 DataFrames from the dictionaries and then use standard .merge()
:
import pandas as pd
relationships = {
"url1": ["parent1", "parent2", "parent3"],
"url2": ["parent1", "parent2", "parent3"],
}
url_results = {
"url1": ["page1", "redirect1", "content_type1"],
"url2": ["page2", "redirect2", "content_type2"],
}
df1 = pd.DataFrame(
[(k, p) for k, v in relationships.items() for p in v],
columns=["url", "parent"],
)
df2 = pd.DataFrame(
[(k, *v) for k, v in url_results.items()],
columns=["url", "page", "redirect", "content_type"],
)
print(df1.merge(df2, on="url"))
Prints:
url parent page redirect content_type
0 url1 parent1 page1 redirect1 content_type1
1 url1 parent2 page1 redirect1 content_type1
2 url1 parent3 page1 redirect1 content_type1
3 url2 parent1 page2 redirect2 content_type2
4 url2 parent2 page2 redirect2 content_type2
5 url2 parent3 page2 redirect2 content_type2
CodePudding user response:
Given Something like:
relationships = {
"url1": ["parent1", "parent2", "parent3"],
"url2": ["parent1", "parent2", "parent3"]
}
url_results = {
"url1": ["page", "redirect", "content_type"],
"url2": ["page", "redirect", "content_type"]
}
One could use a comprehension and load this data into pandas then persist it to excel like:
import pandas
print(pandas.DataFrame(
[parent, rk] url_results.get(rk, [])
for rk, rv in relationships.items()
for parent in rv
))
Giving you a table:
0 1 2 3 4
0 parent1 url1 page redirect content_type
1 parent2 url1 page redirect content_type
2 parent3 url1 page redirect content_type
3 parent1 url2 page redirect content_type
4 parent2 url2 page redirect content_type
5 parent3 url2 page redirect content_type
But if the ultimate objective was to persist it to a file, I would skip pandas and stream the results to a CSV:
import csv
with open("./out.csv", "w", encoding="utf-8", newline="") as file_out:
writer = csv.writer(file_out)
writer.writerow(["parent", "url", "page", "redirect", "content_type"])
writer.writerows(
[parent, rk] url_results.get(rk, [])
for rk, rv in relationships.items()
for parent in rv
)
Giving you a csv file like:
parent,url,page,redirect,content_type
parent1,url1,page,redirect,content_type
parent2,url1,page,redirect,content_type
parent3,url1,page,redirect,content_type
parent1,url2,page,redirect,content_type
parent2,url2,page,redirect,content_type
parent3,url2,page,redirect,content_type