Home > Software design >  grep XML value and export it if it contains number
grep XML value and export it if it contains number

Time:09-12

I have an XML file which contains links to products, and category, each category ends with word and slash like https://url.com/category/subcatgory and they are bounded inside <loc> </loc>

However each product has link which end on 6 digit number, for example https://url.com/category/subcategory/product-name-of-something-154555

I am trying to grep this while i get the file with wget, so I am experimenting only on the grep part now, I know how to get the file, and open it.

This is the code I have been running but is exporting all links even the categories.

grep -Po "(?<=<loc>)(.*)[0-9]{6}/(?=</loc>)" nameofmyfile.xml

But I succeed to grep each 6 digit code with this code:

grep -oP "(?<=<loc>)*[0-9]{6}/(?=</loc>)" nameofmyfile.xml

but then again I need the part in front of that link cause I only get: 666444/ when running this.

The file structure is this:

<?xml version="1.0" encoding="utf-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
    <url>
        <loc>https://somelink.com/category/building-materials/concrete/hand-tools/</loc>
        <lastmod>2022-09-11T02:10:42 02:00</lastmod>
        <changefreq>weekly</changefreq>
        <priority>0.8</priority>
    </url>
    <url>
        <loc>https://somelink.com/category/building-materials/concrete/hand-tools/screws/screws-145890/</loc>
        <lastmod>2022-09-11T02:11:06 02:00</lastmod>
        <changefreq>weekly</changefreq>
        <priority>0.8</priority>
    </url>
        <url>
        <loc>https://somelink.com/category/building-materials/concrete/hand-tools/screws/screws-145489/</loc>
        <lastmod>2022-09-11T02:11:14 02:00</lastmod>
        <changefreq>weekly</changefreq>
        <priority>0.8</priority>
    </url>
    <url>
        <loc>https://somelink.com/category/building-materials/concrete/hand-tools/hammer/hammer-145488/</loc>
        <lastmod>2022-09-11T02:10:42 02:00</lastmod>
        <changefreq>weekly</changefreq>
        <priority>0.8</priority>
    </url>
    <url>
        <loc>https://somelink.com/category/inside/heating/floor-heating/pert-222-010274/</loc>
        <lastmod>2022-09-11T02:11:06 02:00</lastmod>
        <changefreq>weekly</changefreq>
        <priority>0.8</priority>
    </url>
        <url>
        <loc>https://somelink.com/category/building-materials/paint/</loc>
        <lastmod>2022-09-11T02:11:14 02:00</lastmod>
        <changefreq>weekly</changefreq>
        <priority>0.8</priority>
    </url>
    <url>
        <loc>https://somelink.com/category/building-materials/screws-and-nails/</loc>
        <lastmod>2022-09-11T02:10:42 02:00</lastmod>
        <changefreq>weekly</changefreq>
        <priority>0.8</priority>
    </url>
    <url>
        <loc>https://somelink.com/category/building-materials/concrete/power-toools/</loc>
        <lastmod>2022-09-11T02:11:06 02:00</lastmod>
        <changefreq>weekly</changefreq>
        <priority>0.8</priority>
    </url>
        <url>
        <loc>https://somelink.com/category/inside/heating/floor-heating/pert-182-010272/</loc>
        <lastmod>2022-09-11T02:11:14 02:00</lastmod>
        <changefreq>weekly</changefreq>
        <priority>0.8</priority>
    </url>
    <url>
        <loc>https://somelink.com/category/inside/heating/floor-heating/pert-202-010273/</loc>
        <lastmod>2022-09-11T02:10:42 02:00</lastmod>
        <changefreq>weekly</changefreq>
        <priority>0.8</priority>
    </url>
    <url>
        <loc>https://somelink.com/category/inside/bathroom/</loc>
        <lastmod>2022-09-11T02:11:06 02:00</lastmod>
        <changefreq>weekly</changefreq>
        <priority>0.8</priority>
    </url>
        <url>
        <loc>https://somelink.com/category/inside/pipes/draining-pipes-168544/</loc>
        <lastmod>2022-09-11T02:11:14 02:00</lastmod>
        <changefreq>weekly</changefreq>
        <priority>0.8</priority>
    </url>
</xml>

How can I extract all links which ends on -XXXXXX/ and skip the other? and they are inside the <loc> </loc>

CodePudding user response:

If you want to use grep to get the numbers only:

<loc>[^>]*\K\d{6}(?=/</loc>)

Explanation

  • <loc> Match literally
  • [^>]* Optionally match any char except >
  • \K Forget what is matched so far
  • [0-9]{6} Match 6 digits
  • (?=/</loc>) Positive lookahead, assert /</loc> to the right

See a regex demo.

Example

grep -Po "<loc>[^>]*\K\d{6}(?=/</loc>)" nameofmyfile.xml

Output

145890
145489
145488
010274
010272
010273
168544

CodePudding user response:

xmllintcan be used after wgetto get the links ending in -<6 numbers>. The trick is to substitute numbers with underscores and then detecting that

cat tmp.xml | xmllint  --xpath '//*[local-name()="loc" and contains(translate(.,"0123456789","__________"), "-______")]/text()' tmp.xml -

result

https://somelink.com/category/building-materials/concrete/hand-tools/screws/screws-145890/
https://somelink.com/category/building-materials/concrete/hand-tools/screws/screws-145489/
https://somelink.com/category/building-materials/concrete/hand-tools/hammer/hammer-145488/
https://somelink.com/category/inside/heating/floor-heating/pert-222-010274/
https://somelink.com/category/inside/heating/floor-heating/pert-182-010272/
https://somelink.com/category/inside/heating/floor-heating/pert-202-010273/
https://somelink.com/category/inside/pipes/draining-pipes-168544/

or saving wget ouput to a tmp file

(echo "setrootns"; echo 'cat //defaultns:loc[contains(translate(.,"0123456789","__________"), "-______")]/text()') | xmllint  --shell tmp.xml | grep -v ' ----'

result

/ > setrootns
/ > cat //defaultns:loc[contains(translate(.,"0123456789","__________"), "-______")]/text()
https://somelink.com/category/building-materials/concrete/hand-tools/screws/screws-145890/
https://somelink.com/category/building-materials/concrete/hand-tools/screws/screws-145489/
https://somelink.com/category/building-materials/concrete/hand-tools/hammer/hammer-145488/
https://somelink.com/category/inside/heating/floor-heating/pert-222-010274/
https://somelink.com/category/inside/heating/floor-heating/pert-182-010272/
https://somelink.com/category/inside/heating/floor-heating/pert-202-010273/
https://somelink.com/category/inside/pipes/draining-pipes-168544/
/ > 

CodePudding user response:

With your shown samples please try following awk code. Written and tested in GNU awk. Simple explanation would be, setting RS(record separator) as regex (^|\n[[:space:]] )<loc>[^<]*<\\/loc>\n and then in main program checking condition and removing unnecessary parts from its value. Then checking if <loc>.....</loc> part is having 6 digits after -(last slash of that line before </loc>) if yes then printing it as per requirement.

awk -v RS='(^|\n[[:space:]] )<loc>[^<]*<\\/loc>\n' '
RT{
  num=split(RT,arr,"[-/]")
  if(arr[num-2]~/^[0-9]{6}$/){
     print arr[num-2]
  }
}
'  Input_file

Here is the Online demo for used regex.

NOTE: In regex demo site, capturing group is changed to non-capturing group AND double escape is NOT used for / to make it clear as per regex site but one should use regex used in above code only in GNU awk.

  • Related