Home > Enterprise >  Request to Github search without pagination
Request to Github search without pagination

Time:10-23

I'm trying to get, in a Github organization, a list of repositories that depends on specific ones.

For example. If I have a library called string_utils, I want to search only the name of the repositories (not the files, not the content, just the repository names) that contains something like import string_utils.

When I search using the Github browser, I got all the files in repositories and the specific match line with its content. I only want the repository names.

I ended copying the request as a curl request, and perform some shell scripting:

query_string="q=<query search> -repo:<repo>"
url="https://<url stuff>/search?q=${github_query}"

grep_regex='<a. . href="/[^\/] /\K[^\"] (?=">)'

declare -a dependents="$(
  curl -G \
    --data-urlencode "${query_string}" \
    --data-urlencode "type=Code" \
    -H "${COOKIE_HEADER}" \
    --silent \
    "$url" \
  | grep -Po "$grep_regex" \
  | awk '!unique[$0]  '
)"

But I got less repositories than the expected. I think that is because of a pagination issue.

Does anyone know how to get all the results without pagination, or a better approach?

CodePudding user response:

You should use the GitHub REST API for this. There are specific endpoints for searching. The API is designed for machine-readable interactions, whereas using a web scraping technique like you're doing could break at any time and may well be blocked as an anti-abuse measure.

However, do note that the REST API responses are still paginated. That's because GitHub doesn't know how many responses you'll want, and there may be many responses to your request (for search, possibly millions). If you only want the first thousand, then it would be very wasteful to generate the remainder of the responses, so GitHub requires you to request no more than one hundred at a time. This is a standard measure on REST APIs to provide good performance and avoid DoS attacks.

  • Related