Home > OS >  How to add missing rows (URLs containing page numbers) to an array (like seq in Linux)
How to add missing rows (URLs containing page numbers) to an array (like seq in Linux)

Time:04-23

I have an array consisting of URLS of the form:

$URLs = @("https://somesite.com/folder1/page/1/"
,"https://somesite.com/folder222/page/1/"
,"https://somesite.com/folder222/page/2/"
,"https://somesite.com/folder444/page/1/"
,"https://somesite.com/folder444/page/3/"
,"https://somesite.com/folderBBB/page/1/"
,"https://somesite.com/folderBBB/page/5/")

They always have /page/1/, I need to add (or reconstruct) all missing URLS from the highest page down to 1 so it ends up like so:

$URLs = @("https://somesite.com/folder1/page/1/"
,"https://somesite.com/folder222/page/1/"
,"https://somesite.com/folder222/page/2/"
,"https://somesite.com/folder444/page/1/"
,"https://somesite.com/folder444/page/2/"
,"https://somesite.com/folder444/page/3/"
,"https://somesite.com/folderBBB/page/1/"
,"https://somesite.com/folderBBB/page/2/"
,"https://somesite.com/folderBBB/page/3/"
,"https://somesite.com/folderBBB/page/4/"
,"https://somesite.com/folderBBB/page/5/")

I'd imagine the Pseudo-Code would be something like:

  • For each folder, extract the highest page number:

hxxps://somesite.com/folderBBB/page/5/

  • Expand this out from (5) to (1)

     hxxps://somesite.com/folderBBB/page/1/
      hxxps://somesite.com/folderBBB/page/2/
      hxxps://somesite.com/folderBBB/page/3/
      hxxps://somesite.com/folderBBB/page/4/
      hxxps://somesite.com/folderBBB/page/5/
    
  • Output this into an array

Any pointers would be welcome!

CodePudding user response:

You can use a pipeline-based solution via the Group-Object cmdlet as follows:

$URLs = @("https://somesite.com/folder1/page/1/"
  , "https://somesite.com/folder222/page/1/"
  , "https://somesite.com/folder222/page/2/"
  , "https://somesite.com/folder444/page/1/"
  , "https://somesite.com/folder444/page/3/"
  , "https://somesite.com/folderBBB/page/1/"
  , "https://somesite.com/folderBBB/page/5/")

$URLs |
  Group-Object { $_ -replace '[^/] /$' } | # Group by shared prefix
    ForEach-Object {
      # Extract the start and end number for the group at hand.
      [int] $from, [int] $to = 
        ($_.Group[0], $_.Group[-1]) -replace '^. /([^/] )/$', '$1'
      # Generate the output URLs.
      # You can assign the entire pipeline to a variable 
      # ($generatedUrls = $URLs | ...) to capture them in an array.
      foreach ($i in $from..$to) { $_.Name   $i   '/' }
    }

Note:

  • The assumption is that the first and last element in each group of URLs that share the same prefix always contain the start and end point of the desired enumeration, respectively.

    • If that assumption doesn't hold, use the following instead:

      $minMax = $_.Group -replace '^. /([^/] )/$', '$1' |
                  Measure-Object -Minimum -Maximum
      $from, $to = $minMax.Minimum, $minMax.Maximum
      
  • The regex-based -replace operator is used for two things:

    • -replace '[^/] /$' eliminates the last component from each URL, so as to group them by their shared prefix.

    • -replace '^. /([^/] )/$', '$1' effectively extracts the last component from each given URL, i.e. the numbers that represent the start and end point of the desired enumeration.


Procedural alternative:

# Build a map (ordered hashtable) that maps URL prefixes
# to the number suffixes that occur among the URLs sharing
# the same prefix.
$map = [ordered] @{}
foreach ($url in $URLs) {
  if ($url -match '^(. )/([^/] )/') {
    $prefix, [int] $num = $Matches[1], $Matches[2]
    $map[$prefix] = [array] $map[$prefix]   $num
  }
}

# Process the map to generate the URLs.
# Again, use something like
#    $generatedUrls = foreach ...
# to capture them in an array.
foreach ($prefix in $map.Keys) {
  $nums = $map[$prefix]
  $from, $to = $nums[0], $nums[-1]
  foreach ($num in $from..$to) {
    '{0}/{1}/' -f $prefix, $num  # synthesize URL and output it.
  }
}
  • Related