Home > Software design >  Recognizing and Keeping Elements Containing Certain Patterns in a List
Recognizing and Keeping Elements Containing Certain Patterns in a List

Time:07-05

I want to try and webscrape my own Stackoverflow Profiles! By this I mean, get an html link of every question I have ever asked:

I tried to do this follows:

library(rvest)
library(httr)
library(XML)

url<-"https://stackoverflow.com/users/18181916/antonoyaro8?tab=questions&sort=newest"
page <-read_html(url)

resource <- GET(url)
parse <- htmlParse(resource)
links <- list(xpathSApply(parse, path="//a", xmlGetAttr, "href"))

I tried to pick up on a pattern and noticed that all links with questions have some number - so I tried to write a code that checks if elements in the list contain a number and keep these links:

rv <- c("1", "2", "3", "4", "5", "6", "7", "8", "9", "0")

final <- unique (grep(paste(rv,collapse="|"), 
                      links, value=TRUE))

But I don't think I am doing this correctly - apart from the messy formatting, the final file is returning links that do not contain any numbers at all.

Thank you!

CodePudding user response:

The output is a list of length 1. We need to extract ([[) the element before applying the grep

unique (grep(paste(rv,collapse="|"), 
                      links[[1]], value=TRUE))

Note that the rv includes numbers 0 to 9 and it can match a digit if it is present anywhere in the link. If the intention is to subset the digits following the questions

grep("questions/\\d ", links[[1]], value = TRUE)

-output

[1] "/questions/72859976/recognizing-and-keeping-elements-containing-certain-patterns-in-a-list"    
 [2] "/questions/72843570/combing-two-selections-together"                                           
 [3] "/questions/72840913/selecting-rows-from-a-table-based-on-a-list"                               
 [4] "/questions/72840624/even-out-table-in-r"                                                       
 [5] "/questions/72840548/creating-a-dictionary-reference-table"                                     
 [6] "/questions/72837147/sequentially-replacing-factor-variables-with-numerical-values"             
 [7] "/questions/72822951/scanning-and-replacing-values-of-rows-in-r"                                
 [8] "/questions/72822781/alternative-to-do-callrbind-data-frame-for-combining-a-list-of-data-frames"
 [9] "/questions/72738885/referencing-a-query-in-another-query"                                      
[10] "/questions/72725108/defining-cte-common-table-expressions-in-r"                                
[11] "/questions/72723768/creating-an-id-variable-on-the-spot"                                       
[12] "/questions/72720013/selecting-data-using-conditions-stored-in-a-variable"                     
[13] "/questions/72717135/effecient-ways-to-append-sql-results-in-r"
...

If there are multiple pages, add the page= with paste or sprintf

urls <- c(url, sprintf("%s&page=%d", url, 2:3))
out_lst <- lapply(urls, function(url)
    {
   page <-read_html(url)

   resource <- GET(url)
   parse <- htmlParse(resource)
   links <- list(xpathSApply(parse, path="//a", xmlGetAttr, "href"))  
   grep("questions/\\d ", links[[1]], value = TRUE)

})

-output

> out_lst
[[1]]
 [1] "/questions/72859976/recognizing-and-keeping-elements-containing-certain-patterns-in-a-list"    
 [2] "/questions/72843570/combing-two-selections-together"                                           
 [3] "/questions/72840913/selecting-rows-from-a-table-based-on-a-list"                               
 [4] "/questions/72840624/even-out-table-in-r"                                                       
 [5] "/questions/72840548/creating-a-dictionary-reference-table"                                     
 [6] "/questions/72837147/sequentially-replacing-factor-variables-with-numerical-values"             
 [7] "/questions/72822951/scanning-and-replacing-values-of-rows-in-r"                                
 [8] "/questions/72822781/alternative-to-do-callrbind-data-frame-for-combining-a-list-of-data-frames"
 [9] "/questions/72738885/referencing-a-query-in-another-query"                                      
[10] "/questions/72725108/defining-cte-common-table-expressions-in-r"                                
[11] "/questions/72723768/creating-an-id-variable-on-the-spot"                                       
[12] "/questions/72720013/selecting-data-using-conditions-stored-in-a-variable"                      
[13] "/questions/72717135/effecient-ways-to-append-sql-results-in-r"                                 
[14] "/questions/72710448/removing-files-from-global-environment-with-a-certain-pattern"             
[15] "/questions/72710203/r-sql-is-the-default-option-sampling-with-replacement"                     
[16] "/questions/72695401/allocating-max-memory-in-r"                                                
[17] "/questions/72681898/randomly-delete-columns-from-datasets"                                     
[18] "/questions/72663516/are-rds-files-more-efficient-than-csv-files"                               
[19] "/questions/72625690/importing-files-using-list-files"                                          
[20] "/questions/72623856/second-most-common-element-in-each-row"                                    
[21] "/questions/72623744/counting-the-position-where-a-pattern-is-completed"                        
[22] "/questions/72620501/bulk-import-export-files-from-r"                                           
[23] "/questions/72613413/counting-every-position-where-a-pattern-appears"                           
[24] "/questions/72612577/counting-the-position-of-the-first-0-in-each-row"                          
[25] "/questions/72607160/taking-averages-across-lists"                                              
[26] "/questions/72589276/functions-for-finding-out-the-midpoint-interpolation"                      
[27] "/questions/72587298/sandwiching-values-between-rows"                                           
[28] "/questions/72569338/integration-error-lengthlower-1-is-not-true"                               
[29] "/questions/72568817/synchronizing-nas-in-r"                                                    
[30] "/questions/72568661/finding-the-loser-in-each-row"                                             

[[2]]
 [1] "/questions/72566170/making-a-race-between-two-variables"                          
 [2] "/questions/72418723/making-a-list-of-random-numbers"                              
 [3] "/questions/72418364/random-uniform-numbers-without-runif"                         
 [4] "/questions/72353102/integrate-normal-distribution-between-2-values"               
 [5] "/questions/72174868/placing-commas-between-names"                                 
 [6] "/questions/72163297/simulate-flipping-french-fries-in-r"                          
 [7] "/questions/71982286/alternatives-to-the-partition-by-statement-in-sql"            
 [8] "/questions/71970960/converting-lists-into-data-frames"                            
 [9] "/questions/71970672/random-numbers-are-too-similar-to-each-other"                 
[10] "/questions/71933753/making-combinations-of-items"                                 
[11] "/questions/71874791/sorting-rows-in-specified-order"                              
[12] "/questions/71866097/hiding-the-legend-in-this-graph"                              
[13] "/questions/71866048/understanding-the-median-in-this-graph"                       
[14] "/questions/71852517/nas-produced-when-number-of-iterations-increase"              
[15] "/questions/71791906/assigning-unique-colors-to-multiple-lines-on-a-graph"         
[16] "/questions/71787336/finding-identical-rows-in-multiple-datasets"                  
[17] "/questions/71758983/multiple-replace-lookups"                                     
[18] "/questions/71758648/create-ascending-id-in-a-data-frame"                          
[19] "/questions/71731208/webscraping-data-which-pokemon-can-learn-which-attacks"       
[20] "/questions/71728273/webscraping-pokemon-data"                                     
[21] "/questions/71683045/identifying-smallest-element-in-each-row-of-a-matrix"         
[22] "/questions/71671488/connecting-all-nodes-together-on-a-graph"                     
[23] "/questions/71641774/overriding-colors-in-ggplot2"                                 
[24] "/questions/71641404/applying-a-function-to-a-data-frame-lapply-vs-traditional-way"
[25] "/questions/71624111/sending-emails-from-r"                                        
[26] "/questions/71623019/sql-joining-tables-from-2-different-servers-r-vs-sas"         
[27] "/questions/71429265/overriding-sql-errors-during-r-uploads"                       
[28] "/questions/71429129/splitting-a-dataset-into-uneven-portions"                     
[29] "/questions/71418533/multiplying-and-adding-values-across-rows"                    
[30] "/questions/71417489/tricking-an-sql-server-to-accept-a-file-from-r"               

[[3]]
 [1] "/questions/71417218/splitting-a-dataset-into-arbitrary-sections"            
 [2] "/questions/71398804/plotting-vector-fields-and-gradient-fields"             
 [3] "/questions/71387596/animating-the-mandelbrot-set"                           
 [4] "/questions/71358405/repeat-a-set-of-ids-for-every-n-rows"                   
 [5] "/questions/71344822/time-series-graphs-with-different-symbols"              
 [6] "/questions/71341865/creating-a-data-frame-with-commas"                      
 [7] "/questions/71287944/converting-igraph-to-visnetwork"                        
 [8] "/questions/71282863/fixing-the-first-and-last-numbers-in-a-random-list"     
 [9] "/questions/71282403/adding-labels-to-graph-nodes"                           
[10] "/questions/71262761/understanding-list-and-do-call-commands"                
[11] "/questions/71261431/adjusting-graph-layouts"                                
[12] "/questions/71255038/overriding-non-existent-components-in-a-loop"           
[13] "/questions/71244872/fixing-cluttered-titles-on-graphs"                      
[14] "/questions/71243676/directly-adding-titles-and-labels-to-visnetwork"        
[15] "/questions/71232353/removing-all-edges-in-igraph"                           
[16] "/questions/71230273/writing-a-function-that-references-elements-in-a-matrix"
[17] "/questions/71227260/generating-random-graphs-according-to-some-conditions"  
[18] "/questions/71087349/adding-combinations-of-numbers-in-a-matrix"   
  • Related