I want to try and webscrape my own Stackoverflow Profiles! By this I mean, get an html link of every question I have ever asked:
- https://stackoverflow.com/users/18181916/antonoyaro8
- https://math.stackexchange.com/users/1024449/antonoyaro8
I tried to do this follows:
library(rvest)
library(httr)
library(XML)
url<-"https://stackoverflow.com/users/18181916/antonoyaro8?tab=questions&sort=newest"
page <-read_html(url)
resource <- GET(url)
parse <- htmlParse(resource)
links <- list(xpathSApply(parse, path="//a", xmlGetAttr, "href"))
I tried to pick up on a pattern and noticed that all links with questions have some number - so I tried to write a code that checks if elements in the list contain a number and keep these links:
rv <- c("1", "2", "3", "4", "5", "6", "7", "8", "9", "0")
final <- unique (grep(paste(rv,collapse="|"),
links, value=TRUE))
But I don't think I am doing this correctly - apart from the messy formatting, the final file is returning links that do not contain any numbers at all.
Can someone please show me how to webscrape these links properly, and then repeat this for all pages (e.g. https://stackoverflow.com/users/18181916/antonoyaro8?tab=questions&sort=newest, https://stackoverflow.com/users/18181916/antonoyaro8?tab=questions&sort=newest&page=2, https://stackoverflow.com/users/18181916/antonoyaro8?tab=questions&sort=newest&page=3)
Worse come to worst, if I can do it for one of these pages, I can manually copy/paste the code for all pages and proceed that way.
Thank you!
CodePudding user response:
The output is a list
of length 1. We need to extract ([[
) the element before applying the grep
unique (grep(paste(rv,collapse="|"),
links[[1]], value=TRUE))
Note that the rv
includes numbers 0 to 9 and it can match a digit if it is present anywhere in the link. If the intention is to subset the digits following the questions
grep("questions/\\d ", links[[1]], value = TRUE)
-output
[1] "/questions/72859976/recognizing-and-keeping-elements-containing-certain-patterns-in-a-list"
[2] "/questions/72843570/combing-two-selections-together"
[3] "/questions/72840913/selecting-rows-from-a-table-based-on-a-list"
[4] "/questions/72840624/even-out-table-in-r"
[5] "/questions/72840548/creating-a-dictionary-reference-table"
[6] "/questions/72837147/sequentially-replacing-factor-variables-with-numerical-values"
[7] "/questions/72822951/scanning-and-replacing-values-of-rows-in-r"
[8] "/questions/72822781/alternative-to-do-callrbind-data-frame-for-combining-a-list-of-data-frames"
[9] "/questions/72738885/referencing-a-query-in-another-query"
[10] "/questions/72725108/defining-cte-common-table-expressions-in-r"
[11] "/questions/72723768/creating-an-id-variable-on-the-spot"
[12] "/questions/72720013/selecting-data-using-conditions-stored-in-a-variable"
[13] "/questions/72717135/effecient-ways-to-append-sql-results-in-r"
...
If there are multiple pages, add the page=
with paste
or sprintf
urls <- c(url, sprintf("%s&page=%d", url, 2:3))
out_lst <- lapply(urls, function(url)
{
page <-read_html(url)
resource <- GET(url)
parse <- htmlParse(resource)
links <- list(xpathSApply(parse, path="//a", xmlGetAttr, "href"))
grep("questions/\\d ", links[[1]], value = TRUE)
})
-output
> out_lst
[[1]]
[1] "/questions/72859976/recognizing-and-keeping-elements-containing-certain-patterns-in-a-list"
[2] "/questions/72843570/combing-two-selections-together"
[3] "/questions/72840913/selecting-rows-from-a-table-based-on-a-list"
[4] "/questions/72840624/even-out-table-in-r"
[5] "/questions/72840548/creating-a-dictionary-reference-table"
[6] "/questions/72837147/sequentially-replacing-factor-variables-with-numerical-values"
[7] "/questions/72822951/scanning-and-replacing-values-of-rows-in-r"
[8] "/questions/72822781/alternative-to-do-callrbind-data-frame-for-combining-a-list-of-data-frames"
[9] "/questions/72738885/referencing-a-query-in-another-query"
[10] "/questions/72725108/defining-cte-common-table-expressions-in-r"
[11] "/questions/72723768/creating-an-id-variable-on-the-spot"
[12] "/questions/72720013/selecting-data-using-conditions-stored-in-a-variable"
[13] "/questions/72717135/effecient-ways-to-append-sql-results-in-r"
[14] "/questions/72710448/removing-files-from-global-environment-with-a-certain-pattern"
[15] "/questions/72710203/r-sql-is-the-default-option-sampling-with-replacement"
[16] "/questions/72695401/allocating-max-memory-in-r"
[17] "/questions/72681898/randomly-delete-columns-from-datasets"
[18] "/questions/72663516/are-rds-files-more-efficient-than-csv-files"
[19] "/questions/72625690/importing-files-using-list-files"
[20] "/questions/72623856/second-most-common-element-in-each-row"
[21] "/questions/72623744/counting-the-position-where-a-pattern-is-completed"
[22] "/questions/72620501/bulk-import-export-files-from-r"
[23] "/questions/72613413/counting-every-position-where-a-pattern-appears"
[24] "/questions/72612577/counting-the-position-of-the-first-0-in-each-row"
[25] "/questions/72607160/taking-averages-across-lists"
[26] "/questions/72589276/functions-for-finding-out-the-midpoint-interpolation"
[27] "/questions/72587298/sandwiching-values-between-rows"
[28] "/questions/72569338/integration-error-lengthlower-1-is-not-true"
[29] "/questions/72568817/synchronizing-nas-in-r"
[30] "/questions/72568661/finding-the-loser-in-each-row"
[[2]]
[1] "/questions/72566170/making-a-race-between-two-variables"
[2] "/questions/72418723/making-a-list-of-random-numbers"
[3] "/questions/72418364/random-uniform-numbers-without-runif"
[4] "/questions/72353102/integrate-normal-distribution-between-2-values"
[5] "/questions/72174868/placing-commas-between-names"
[6] "/questions/72163297/simulate-flipping-french-fries-in-r"
[7] "/questions/71982286/alternatives-to-the-partition-by-statement-in-sql"
[8] "/questions/71970960/converting-lists-into-data-frames"
[9] "/questions/71970672/random-numbers-are-too-similar-to-each-other"
[10] "/questions/71933753/making-combinations-of-items"
[11] "/questions/71874791/sorting-rows-in-specified-order"
[12] "/questions/71866097/hiding-the-legend-in-this-graph"
[13] "/questions/71866048/understanding-the-median-in-this-graph"
[14] "/questions/71852517/nas-produced-when-number-of-iterations-increase"
[15] "/questions/71791906/assigning-unique-colors-to-multiple-lines-on-a-graph"
[16] "/questions/71787336/finding-identical-rows-in-multiple-datasets"
[17] "/questions/71758983/multiple-replace-lookups"
[18] "/questions/71758648/create-ascending-id-in-a-data-frame"
[19] "/questions/71731208/webscraping-data-which-pokemon-can-learn-which-attacks"
[20] "/questions/71728273/webscraping-pokemon-data"
[21] "/questions/71683045/identifying-smallest-element-in-each-row-of-a-matrix"
[22] "/questions/71671488/connecting-all-nodes-together-on-a-graph"
[23] "/questions/71641774/overriding-colors-in-ggplot2"
[24] "/questions/71641404/applying-a-function-to-a-data-frame-lapply-vs-traditional-way"
[25] "/questions/71624111/sending-emails-from-r"
[26] "/questions/71623019/sql-joining-tables-from-2-different-servers-r-vs-sas"
[27] "/questions/71429265/overriding-sql-errors-during-r-uploads"
[28] "/questions/71429129/splitting-a-dataset-into-uneven-portions"
[29] "/questions/71418533/multiplying-and-adding-values-across-rows"
[30] "/questions/71417489/tricking-an-sql-server-to-accept-a-file-from-r"
[[3]]
[1] "/questions/71417218/splitting-a-dataset-into-arbitrary-sections"
[2] "/questions/71398804/plotting-vector-fields-and-gradient-fields"
[3] "/questions/71387596/animating-the-mandelbrot-set"
[4] "/questions/71358405/repeat-a-set-of-ids-for-every-n-rows"
[5] "/questions/71344822/time-series-graphs-with-different-symbols"
[6] "/questions/71341865/creating-a-data-frame-with-commas"
[7] "/questions/71287944/converting-igraph-to-visnetwork"
[8] "/questions/71282863/fixing-the-first-and-last-numbers-in-a-random-list"
[9] "/questions/71282403/adding-labels-to-graph-nodes"
[10] "/questions/71262761/understanding-list-and-do-call-commands"
[11] "/questions/71261431/adjusting-graph-layouts"
[12] "/questions/71255038/overriding-non-existent-components-in-a-loop"
[13] "/questions/71244872/fixing-cluttered-titles-on-graphs"
[14] "/questions/71243676/directly-adding-titles-and-labels-to-visnetwork"
[15] "/questions/71232353/removing-all-edges-in-igraph"
[16] "/questions/71230273/writing-a-function-that-references-elements-in-a-matrix"
[17] "/questions/71227260/generating-random-graphs-according-to-some-conditions"
[18] "/questions/71087349/adding-combinations-of-numbers-in-a-matrix"