Trying to scrape all the names from this website with Python:
https://profile.tmb.state.tx.us/Search.aspx?9e94dec6-c7e7-4054-b5fb-20a1fcdbab53
The issue is that it limits each search to the top 50 results.
Since the last name search allows wildcards, I tried using one search result to narrow down subsequent search results (using prefixes). However, this approach becomes difficult when more than 50 people have the same last name.
Any other ideas on how to get every possible name from this website? Thank you!!
CodePudding user response:
Looking at the request and JS, it seems like this limit is server-side. I don't see any way to retrieve more than 50 results.
Brute-force is the only way I think you could scrape this site, and it's not so trivial. You would need to generate queries more and more specific until the response has less than 50 results.
For each length one combination, starting with a
for example's sake, you could search a*
. If there are less than 50 results, scrape them and move on to the next combination. Otherwise you'll need to scrape all length two combinations of characters beginning with a
: aa*
, ab*
, ac*
, etc.
I'm sure there's some term for this, but I don't know it!
CodePudding user response:
I think it will be better with char decrement. Exemple AAB -> AAA. You’ll find all name that the trivial solution but it’ll take a lot of time. For the optimisation you can use headless browser.