Most efficient way to merge two large python dictionaries, Results and Relationships to be written t-CodePudding

I've been working on a webscraper which runs asynchronously to substantially improve speed. Because of the way scraper works, it writes two pieces of information for a url which it scrapes. These are stored in two dictionaries:

# {url1: [parent1, parent2, parent3], url2: [parent1, parent2, parent3]...}
relationships = dict()
# {url1: [page, redirect, content_type...], url2: [page, redirect, content_type...]}
url_results = dict()

The first dictionary, relationships, tracks each occurrence of the key url within a parent url The second dictionary stores the resulting information I need from each url scraped.

This is all necessary for how I'm handling async threading and memorization, and works really nice. The problem lies in writing these results to an excel report. These dictionaries are getting to thousands of keys in size and merging them together is very slow.

The end result is that for each parent value for a url in relationships dict, a line is written in my excel sheet

parent	url	page	redirect	content type	etc
parent1	url1	page	redirect	content_type	etc...
parent1	url2	page	redirect	content_type	etc...
parent1	url3	page	redirect	content_type	etc...
parent2	url1	page	redirect	content_type	etc...
parent2	url2	page	redirect	content_type	etc...
parent2	url3	page	redirect	content_type	etc...

Here's the method I'm currently using to merge these dictionaries together like this:

partent_result_df = pd.DataFrame(columns=["Parent", "URL", "Page", "Redirect", "Content-type", "Source", "Status"])
location = 1
for i, key in enumerate(RELATIONSHIPS):
    # result_arr =  [Item, Redirect, Content_type, Source, Status]
    try:
        result_arr = URL_RESULTS[key]
    except:
        result_arr = [None, None, None, None, "Not Yet Accessed"]
    parents_urls = RELATIONSHIPS[key]
    for parent in parents_urls:
        partent_result_df.loc[location] = [parent, key, *result_arr]
        location  = 1
return partent_result_df

I'm looking for a way to make this merge much faster if at all possible. It's currently taking quite a lot of time to write this report, almost as much as the scrape itself, and I think I'm missing something that could help speed up this process.

I'd appreciate any help finding a better way to do this. Sorry for the information overkill, and thank you greatly for any help!

CodePudding user response：

You can make 2 DataFrames from the dictionaries and then use standard .merge():

import pandas as pd

relationships = {
    "url1": ["parent1", "parent2", "parent3"],
    "url2": ["parent1", "parent2", "parent3"],
}
url_results = {
    "url1": ["page1", "redirect1", "content_type1"],
    "url2": ["page2", "redirect2", "content_type2"],
}

df1 = pd.DataFrame(
    [(k, p) for k, v in relationships.items() for p in v],
    columns=["url", "parent"],
)
df2 = pd.DataFrame(
    [(k, *v) for k, v in url_results.items()],
    columns=["url", "page", "redirect", "content_type"],
)

print(df1.merge(df2, on="url"))

Prints:

    url   parent   page   redirect   content_type
0  url1  parent1  page1  redirect1  content_type1
1  url1  parent2  page1  redirect1  content_type1
2  url1  parent3  page1  redirect1  content_type1
3  url2  parent1  page2  redirect2  content_type2
4  url2  parent2  page2  redirect2  content_type2
5  url2  parent3  page2  redirect2  content_type2

CodePudding user response：

Given Something like:

relationships = {
    "url1": ["parent1", "parent2", "parent3"],
    "url2": ["parent1", "parent2", "parent3"]
}

url_results = {
    "url1": ["page", "redirect", "content_type"],
    "url2": ["page", "redirect", "content_type"]
}

One could use a comprehension and load this data into pandas then persist it to excel like:

import pandas
print(pandas.DataFrame(
    [parent, rk]   url_results.get(rk, [])
    for rk, rv in relationships.items()
    for parent in rv
))

Giving you a table:

         0     1     2         3             4
0  parent1  url1  page  redirect  content_type
1  parent2  url1  page  redirect  content_type
2  parent3  url1  page  redirect  content_type
3  parent1  url2  page  redirect  content_type
4  parent2  url2  page  redirect  content_type
5  parent3  url2  page  redirect  content_type

But if the ultimate objective was to persist it to a file, I would skip pandas and stream the results to a CSV:

import csv
with open("./out.csv", "w", encoding="utf-8", newline="") as file_out:
    writer = csv.writer(file_out)
    writer.writerow(["parent", "url", "page", "redirect", "content_type"])
    writer.writerows(
        [parent, rk]   url_results.get(rk, [])
        for rk, rv in relationships.items()
        for parent in rv
    )

Giving you a csv file like:

parent,url,page,redirect,content_type
parent1,url1,page,redirect,content_type
parent2,url1,page,redirect,content_type
parent3,url1,page,redirect,content_type
parent1,url2,page,redirect,content_type
parent2,url2,page,redirect,content_type
parent3,url2,page,redirect,content_type