I'm a web scraping newbie trying to efficiently scrape data from signal.nfx.com. The issue i have is that i keep scraping the same data over and over making my scraper inefficient. I want to be able to scrape all investors in a page but i am scraping just a few per page repeatedly, how can i resolve this? check the code below:
url= "https://signal-api.nfx.com/graphql"
headers = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36'}
payload = {"operationName":"vclInvestors",
"variables":{"slug":"gig-economy-pre-seed",
"order":[{}],
"after":"OA"},
"query":"query vclInvestors($slug: String!, $after: String) {\n list(slug: $slug) {\n id\n slug\n investor_count\n vertical {\n id\n display_name\n kind\n __typename\n }\n location {\n id\n display_name\n __typename\n }\n stage\n firms {\n id\n name\n slug\n __typename\n }\n scored_investors(first: 8, after: $after) {\n pageInfo {\n hasNextPage\n hasPreviousPage\n endCursor\n __typename\n }\n record_count\n edges {\n node {\n ...investorListInvestorProfileFields\n __typename\n }\n __typename\n }\n __typename\n }\n __typename\n }\n}\n\nfragment investorListInvestorProfileFields on InvestorProfile {\n id\n person {\n id\n first_name\n last_name\n name\n slug\n linkedin_url\n twitter_url\n is_me\n is_on_target_list\n __typename\n }\n image_urls\n position\n min_investment\n max_investment\n target_investment\n areas_of_interest_freeform\n is_preferred_coinvestor\n firm {\n id\n current_fund_size\n name\n slug\n __typename\n }\n investment_locations {\n id\n display_name\n location_investor_list {\n stage_name\n id\n slug\n __typename\n }\n __typename\n }\n investor_lists {\n id\n stage_name\n slug\n vertical {\n kind\n id\n display_name\n __typename\n }\n __typename\n }\n __typename\n}\n"}
results = pd.DataFrame()
hasNextPage = True
after = ''
while hasNextPage == True:
payload['variables']['after'] == after
jsonData = requests.post(url, headers=headers, json=payload ).json()
data = jsonData['data']['list']['scored_investors']['edges']
df = pd.json_normalize(data)
results = results.append(df, sort=False).reset_index(drop=True)
count = len(results)
tot = jsonData['data']['list']['investor_count']
print(f'{count} of {tot}')
hasNextPage = jsonData['data']['list']['scored_investors']['pageInfo']['hasNextPage']
after = jsonData['data']['list']['scored_investors']['pageInfo']['endCursor']
i was able to scrape over 50, 000 rows but almost all of them were duplicates, see below:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 55448 entries, 0 to 55447
Data columns (total 28 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 55448 non-null int64
1 __typename 55448 non-null object
2 node.__typename 55448 non-null object
3 node.id 55448 non-null int64
4 node.person.id 55448 non-null int64
5 node.person.first_name 55448 non-null object
6 node.person.last_name 55448 non-null object
7 node.person.name 55448 non-null object
8 node.person.slug 55448 non-null object
9 node.person.linkedin_url 55448 non-null object
10 node.person.twitter_url 20793 non-null object
11 node.person.is_me 55448 non-null bool
12 node.person.is_on_target_list 55448 non-null bool
13 node.person.__typename 55448 non-null object
14 node.image_urls 55448 non-null object
15 node.position 55448 non-null object
16 node.min_investment 55448 non-null int64
17 node.max_investment 55448 non-null int64
18 node.target_investment 55448 non-null int64
19 node.areas_of_interest_freeform 20793 non-null object
20 node.is_preferred_coinvestor 55448 non-null bool
21 node.firm.id 55448 non-null int64
22 node.firm.current_fund_size 0 non-null float64
23 node.firm.name 55448 non-null object
24 node.firm.slug 55448 non-null object
25 node.firm.__typename 55448 non-null object
26 node.investment_locations 55448 non-null object
27 node.investor_lists 55448 non-null object
dtypes: bool(3), float64(1), int64(7), object(17)
memory usage: 10.7 MB
After removing duplicates and unnecessary columns:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 8 entries, 0 to 7
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 node.person.name 8 non-null object
1 node.person.linkedin_url 8 non-null object
2 node.person.twitter_url 3 non-null object
3 node.position 8 non-null object
4 node.min_investment 8 non-null int64
5 node.max_investment 8 non-null int64
6 node.target_investment 8 non-null int64
7 node.areas_of_interest_freeform 3 non-null object
8 node.firm.current_fund_size 0 non-null float64
9 node.firm.name 8 non-null object
dtypes: float64(1), int64(3), object(6)
memory usage: 704.0 bytes
CodePudding user response:
You have a typo asigning your after
parameter:
payload['variables']['after'] == after
# ^^ should be just a single =
In general when scraping with while loops you should be very careful and confirm all parameters were set correctly before sending out your requests or you end up just spamming the website.
One easy way to prevent this is to confirm that the hash of a new response hasn't been seen before.