Home > Blockchain >  Python web sraping: How do i avoid scraping duplicates from signal.nfx.com
Python web sraping: How do i avoid scraping duplicates from signal.nfx.com

Time:09-02

I'm a web scraping newbie trying to efficiently scrape data from signal.nfx.com. The issue i have is that i keep scraping the same data over and over making my scraper inefficient. I want to be able to scrape all investors in a page but i am scraping just a few per page repeatedly, how can i resolve this? check the code below:

url= "https://signal-api.nfx.com/graphql"
headers = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36'}
payload = {"operationName":"vclInvestors",
           "variables":{"slug":"gig-economy-pre-seed",
                        "order":[{}],
                        "after":"OA"},
           "query":"query vclInvestors($slug: String!, $after: String) {\n  list(slug: $slug) {\n    id\n    slug\n    investor_count\n    vertical {\n      id\n      display_name\n      kind\n      __typename\n    }\n    location {\n      id\n      display_name\n      __typename\n    }\n    stage\n    firms {\n      id\n      name\n      slug\n      __typename\n    }\n    scored_investors(first: 8, after: $after) {\n      pageInfo {\n        hasNextPage\n        hasPreviousPage\n        endCursor\n        __typename\n      }\n      record_count\n      edges {\n        node {\n          ...investorListInvestorProfileFields\n          __typename\n        }\n        __typename\n      }\n      __typename\n    }\n    __typename\n  }\n}\n\nfragment investorListInvestorProfileFields on InvestorProfile {\n  id\n  person {\n    id\n    first_name\n    last_name\n    name\n    slug\n  linkedin_url\n  twitter_url\n  is_me\n    is_on_target_list\n   __typename\n  }\n  image_urls\n  position\n  min_investment\n  max_investment\n  target_investment\n  areas_of_interest_freeform\n is_preferred_coinvestor\n  firm {\n    id\n  current_fund_size\n  name\n    slug\n    __typename\n  }\n  investment_locations {\n    id\n    display_name\n    location_investor_list {\n   stage_name\n   id\n      slug\n      __typename\n    }\n    __typename\n  }\n  investor_lists {\n    id\n    stage_name\n    slug\n    vertical {\n   kind\n   id\n      display_name\n      __typename\n    }\n    __typename\n  }\n  __typename\n}\n"}


results = pd.DataFrame()
hasNextPage = True
after = ''

while hasNextPage == True:
    payload['variables']['after'] == after
    jsonData = requests.post(url, headers=headers, json=payload ).json()
    data = jsonData['data']['list']['scored_investors']['edges']
    df = pd.json_normalize(data)
    results = results.append(df, sort=False).reset_index(drop=True)
    
    count = len(results) 
    tot = jsonData['data']['list']['investor_count']
    
    print(f'{count} of {tot}')
    
    hasNextPage = jsonData['data']['list']['scored_investors']['pageInfo']['hasNextPage']
    after = jsonData['data']['list']['scored_investors']['pageInfo']['endCursor']

i was able to scrape over 50, 000 rows but almost all of them were duplicates, see below:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 55448 entries, 0 to 55447
Data columns (total 28 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   Unnamed: 0                       55448 non-null  int64  
 1   __typename                       55448 non-null  object 
 2   node.__typename                  55448 non-null  object 
 3   node.id                          55448 non-null  int64  
 4   node.person.id                   55448 non-null  int64  
 5   node.person.first_name           55448 non-null  object 
 6   node.person.last_name            55448 non-null  object 
 7   node.person.name                 55448 non-null  object 
 8   node.person.slug                 55448 non-null  object 
 9   node.person.linkedin_url         55448 non-null  object 
 10  node.person.twitter_url          20793 non-null  object 
 11  node.person.is_me                55448 non-null  bool   
 12  node.person.is_on_target_list    55448 non-null  bool   
 13  node.person.__typename           55448 non-null  object 
 14  node.image_urls                  55448 non-null  object 
 15  node.position                    55448 non-null  object 
 16  node.min_investment              55448 non-null  int64  
 17  node.max_investment              55448 non-null  int64  
 18  node.target_investment           55448 non-null  int64  
 19  node.areas_of_interest_freeform  20793 non-null  object 
 20  node.is_preferred_coinvestor     55448 non-null  bool   
 21  node.firm.id                     55448 non-null  int64  
 22  node.firm.current_fund_size      0 non-null      float64
 23  node.firm.name                   55448 non-null  object 
 24  node.firm.slug                   55448 non-null  object 
 25  node.firm.__typename             55448 non-null  object 
 26  node.investment_locations        55448 non-null  object 
 27  node.investor_lists              55448 non-null  object 
dtypes: bool(3), float64(1), int64(7), object(17)
memory usage: 10.7  MB

After removing duplicates and unnecessary columns:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8 entries, 0 to 7
Data columns (total 10 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   node.person.name                 8 non-null      object 
 1   node.person.linkedin_url         8 non-null      object 
 2   node.person.twitter_url          3 non-null      object 
 3   node.position                    8 non-null      object 
 4   node.min_investment              8 non-null      int64  
 5   node.max_investment              8 non-null      int64  
 6   node.target_investment           8 non-null      int64  
 7   node.areas_of_interest_freeform  3 non-null      object 
 8   node.firm.current_fund_size      0 non-null      float64
 9   node.firm.name                   8 non-null      object 
dtypes: float64(1), int64(3), object(6)
memory usage: 704.0  bytes

CodePudding user response:

You have a typo asigning your after parameter:

payload['variables']['after'] == after
#                             ^^ should be just a single =

In general when scraping with while loops you should be very careful and confirm all parameters were set correctly before sending out your requests or you end up just spamming the website.

One easy way to prevent this is to confirm that the hash of a new response hasn't been seen before.

  • Related