Home > Enterprise >  Most efficient way to merge two large python dictionaries, Results and Relationships to be written t
Most efficient way to merge two large python dictionaries, Results and Relationships to be written t

Time:02-12

I've been working on a webscraper which runs asynchronously to substantially improve speed. Because of the way scraper works, it writes two pieces of information for a url which it scrapes. These are stored in two dictionaries:

# {url1: [parent1, parent2, parent3], url2: [parent1, parent2, parent3]...}
relationships = dict()
# {url1: [page, redirect, content_type...], url2: [page, redirect, content_type...]}
url_results = dict()

The first dictionary, relationships, tracks each occurrence of the key url within a parent url The second dictionary stores the resulting information I need from each url scraped.

This is all necessary for how I'm handling async threading and memorization, and works really nice. The problem lies in writing these results to an excel report. These dictionaries are getting to thousands of keys in size and merging them together is very slow.

The end result is that for each parent value for a url in relationships dict, a line is written in my excel sheet

parent url page redirect content type etc
parent1 url1 page redirect content_type etc...
parent1 url2 page redirect content_type etc...
parent1 url3 page redirect content_type etc...
parent2 url1 page redirect content_type etc...
parent2 url2 page redirect content_type etc...
parent2 url3 page redirect content_type etc...

Here's the method I'm currently using to merge these dictionaries together like this:

partent_result_df = pd.DataFrame(columns=["Parent", "URL", "Page", "Redirect", "Content-type", "Source", "Status"])
location = 1
for i, key in enumerate(RELATIONSHIPS):
    # result_arr =  [Item, Redirect, Content_type, Source, Status]
    try:
        result_arr = URL_RESULTS[key]
    except:
        result_arr = [None, None, None, None, "Not Yet Accessed"]
    parents_urls = RELATIONSHIPS[key]
    for parent in parents_urls:
        partent_result_df.loc[location] = [parent, key, *result_arr]
        location  = 1
return partent_result_df

I'm looking for a way to make this merge much faster if at all possible. It's currently taking quite a lot of time to write this report, almost as much as the scrape itself, and I think I'm missing something that could help speed up this process.

I'd appreciate any help finding a better way to do this. Sorry for the information overkill, and thank you greatly for any help!

CodePudding user response:

You can make 2 DataFrames from the dictionaries and then use standard .merge():

import pandas as pd

relationships = {
    "url1": ["parent1", "parent2", "parent3"],
    "url2": ["parent1", "parent2", "parent3"],
}
url_results = {
    "url1": ["page1", "redirect1", "content_type1"],
    "url2": ["page2", "redirect2", "content_type2"],
}

df1 = pd.DataFrame(
    [(k, p) for k, v in relationships.items() for p in v],
    columns=["url", "parent"],
)
df2 = pd.DataFrame(
    [(k, *v) for k, v in url_results.items()],
    columns=["url", "page", "redirect", "content_type"],
)

print(df1.merge(df2, on="url"))

Prints:

    url   parent   page   redirect   content_type
0  url1  parent1  page1  redirect1  content_type1
1  url1  parent2  page1  redirect1  content_type1
2  url1  parent3  page1  redirect1  content_type1
3  url2  parent1  page2  redirect2  content_type2
4  url2  parent2  page2  redirect2  content_type2
5  url2  parent3  page2  redirect2  content_type2

CodePudding user response:

Given Something like:

relationships = {
    "url1": ["parent1", "parent2", "parent3"],
    "url2": ["parent1", "parent2", "parent3"]
}

url_results = {
    "url1": ["page", "redirect", "content_type"],
    "url2": ["page", "redirect", "content_type"]
}

One could use a comprehension and load this data into pandas then persist it to excel like:

import pandas
print(pandas.DataFrame(
    [parent, rk]   url_results.get(rk, [])
    for rk, rv in relationships.items()
    for parent in rv
))

Giving you a table:

         0     1     2         3             4
0  parent1  url1  page  redirect  content_type
1  parent2  url1  page  redirect  content_type
2  parent3  url1  page  redirect  content_type
3  parent1  url2  page  redirect  content_type
4  parent2  url2  page  redirect  content_type
5  parent3  url2  page  redirect  content_type

But if the ultimate objective was to persist it to a file, I would skip pandas and stream the results to a CSV:

import csv
with open("./out.csv", "w", encoding="utf-8", newline="") as file_out:
    writer = csv.writer(file_out)
    writer.writerow(["parent", "url", "page", "redirect", "content_type"])
    writer.writerows(
        [parent, rk]   url_results.get(rk, [])
        for rk, rv in relationships.items()
        for parent in rv
    )

Giving you a csv file like:

parent,url,page,redirect,content_type
parent1,url1,page,redirect,content_type
parent2,url1,page,redirect,content_type
parent3,url1,page,redirect,content_type
parent1,url2,page,redirect,content_type
parent2,url2,page,redirect,content_type
parent3,url2,page,redirect,content_type
  • Related