I have an XML file which contains links to products, and category, each category ends with word and slash like https://url.com/category/subcatgory and they are bounded inside <loc> </loc>
However each product has link which end on 6 digit number, for example https://url.com/category/subcategory/product-name-of-something-154555
I am trying to grep this while i get the file with wget, so I am experimenting only on the grep part now, I know how to get the file, and open it.
This is the code I have been running but is exporting all links even the categories.
grep -Po "(?<=<loc>)(.*)[0-9]{6}/(?=</loc>)" nameofmyfile.xml
But I succeed to grep each 6 digit code with this code:
grep -oP "(?<=<loc>)*[0-9]{6}/(?=</loc>)" nameofmyfile.xml
but then again I need the part in front of that link cause I only get: 666444/ when running this.
The file structure is this:
<?xml version="1.0" encoding="utf-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>https://somelink.com/category/building-materials/concrete/hand-tools/</loc>
<lastmod>2022-09-11T02:10:42 02:00</lastmod>
<changefreq>weekly</changefreq>
<priority>0.8</priority>
</url>
<url>
<loc>https://somelink.com/category/building-materials/concrete/hand-tools/screws/screws-145890/</loc>
<lastmod>2022-09-11T02:11:06 02:00</lastmod>
<changefreq>weekly</changefreq>
<priority>0.8</priority>
</url>
<url>
<loc>https://somelink.com/category/building-materials/concrete/hand-tools/screws/screws-145489/</loc>
<lastmod>2022-09-11T02:11:14 02:00</lastmod>
<changefreq>weekly</changefreq>
<priority>0.8</priority>
</url>
<url>
<loc>https://somelink.com/category/building-materials/concrete/hand-tools/hammer/hammer-145488/</loc>
<lastmod>2022-09-11T02:10:42 02:00</lastmod>
<changefreq>weekly</changefreq>
<priority>0.8</priority>
</url>
<url>
<loc>https://somelink.com/category/inside/heating/floor-heating/pert-222-010274/</loc>
<lastmod>2022-09-11T02:11:06 02:00</lastmod>
<changefreq>weekly</changefreq>
<priority>0.8</priority>
</url>
<url>
<loc>https://somelink.com/category/building-materials/paint/</loc>
<lastmod>2022-09-11T02:11:14 02:00</lastmod>
<changefreq>weekly</changefreq>
<priority>0.8</priority>
</url>
<url>
<loc>https://somelink.com/category/building-materials/screws-and-nails/</loc>
<lastmod>2022-09-11T02:10:42 02:00</lastmod>
<changefreq>weekly</changefreq>
<priority>0.8</priority>
</url>
<url>
<loc>https://somelink.com/category/building-materials/concrete/power-toools/</loc>
<lastmod>2022-09-11T02:11:06 02:00</lastmod>
<changefreq>weekly</changefreq>
<priority>0.8</priority>
</url>
<url>
<loc>https://somelink.com/category/inside/heating/floor-heating/pert-182-010272/</loc>
<lastmod>2022-09-11T02:11:14 02:00</lastmod>
<changefreq>weekly</changefreq>
<priority>0.8</priority>
</url>
<url>
<loc>https://somelink.com/category/inside/heating/floor-heating/pert-202-010273/</loc>
<lastmod>2022-09-11T02:10:42 02:00</lastmod>
<changefreq>weekly</changefreq>
<priority>0.8</priority>
</url>
<url>
<loc>https://somelink.com/category/inside/bathroom/</loc>
<lastmod>2022-09-11T02:11:06 02:00</lastmod>
<changefreq>weekly</changefreq>
<priority>0.8</priority>
</url>
<url>
<loc>https://somelink.com/category/inside/pipes/draining-pipes-168544/</loc>
<lastmod>2022-09-11T02:11:14 02:00</lastmod>
<changefreq>weekly</changefreq>
<priority>0.8</priority>
</url>
</xml>
How can I extract all links which ends on -XXXXXX/ and skip the other? and they are inside the <loc> </loc>
CodePudding user response:
If you want to use grep to get the numbers only:
<loc>[^>]*\K\d{6}(?=/</loc>)
Explanation
<loc>
Match literally[^>]*
Optionally match any char except>
\K
Forget what is matched so far[0-9]{6}
Match 6 digits(?=/</loc>)
Positive lookahead, assert/</loc>
to the right
See a regex demo.
Example
grep -Po "<loc>[^>]*\K\d{6}(?=/</loc>)" nameofmyfile.xml
Output
145890
145489
145488
010274
010272
010273
168544
CodePudding user response:
xmllint
can be used after wget
to get the links ending in -<6 numbers>
.
The trick is to substitute numbers with underscores and then detecting that
cat tmp.xml | xmllint --xpath '//*[local-name()="loc" and contains(translate(.,"0123456789","__________"), "-______")]/text()' tmp.xml -
result
https://somelink.com/category/building-materials/concrete/hand-tools/screws/screws-145890/
https://somelink.com/category/building-materials/concrete/hand-tools/screws/screws-145489/
https://somelink.com/category/building-materials/concrete/hand-tools/hammer/hammer-145488/
https://somelink.com/category/inside/heating/floor-heating/pert-222-010274/
https://somelink.com/category/inside/heating/floor-heating/pert-182-010272/
https://somelink.com/category/inside/heating/floor-heating/pert-202-010273/
https://somelink.com/category/inside/pipes/draining-pipes-168544/
or saving wget ouput to a tmp file
(echo "setrootns"; echo 'cat //defaultns:loc[contains(translate(.,"0123456789","__________"), "-______")]/text()') | xmllint --shell tmp.xml | grep -v ' ----'
result
/ > setrootns
/ > cat //defaultns:loc[contains(translate(.,"0123456789","__________"), "-______")]/text()
https://somelink.com/category/building-materials/concrete/hand-tools/screws/screws-145890/
https://somelink.com/category/building-materials/concrete/hand-tools/screws/screws-145489/
https://somelink.com/category/building-materials/concrete/hand-tools/hammer/hammer-145488/
https://somelink.com/category/inside/heating/floor-heating/pert-222-010274/
https://somelink.com/category/inside/heating/floor-heating/pert-182-010272/
https://somelink.com/category/inside/heating/floor-heating/pert-202-010273/
https://somelink.com/category/inside/pipes/draining-pipes-168544/
/ >
CodePudding user response:
With your shown samples please try following awk
code. Written and tested in GNU awk
. Simple explanation would be, setting RS
(record separator) as regex (^|\n[[:space:]] )<loc>[^<]*<\\/loc>\n
and then in main program checking condition and removing unnecessary parts from its value. Then checking if <loc>.....</loc>
part is having 6 digits after -
(last slash of that line before </loc>
) if yes then printing it as per requirement.
awk -v RS='(^|\n[[:space:]] )<loc>[^<]*<\\/loc>\n' '
RT{
num=split(RT,arr,"[-/]")
if(arr[num-2]~/^[0-9]{6}$/){
print arr[num-2]
}
}
' Input_file
Here is the Online demo for used regex.
NOTE: In regex demo site, capturing group is changed to non-capturing group AND double escape is NOT used for /
to make it clear as per regex site but one should use regex used in above code only in GNU awk
.