I have a midx file that has a format the same as an XML file.
I am reading this file as:
f = open('visus8.midx', 'r')
regexCommand = re.compile("/\/(\S\w*)*.JPG/im")
for line in f:
matches = regexCommand.findall(str(line))
print(matches)
The file has
<dataset url="/home/siddharth/Desktop/testing/VisusSlamFiles/idx/0000.idx" color="#f4be4bff" quad="0 365.189 5614.56 5.9402e-14 5617.89 3728.26 331.293 3920.91" filenames="/home/siddharth/Desktop/testing/DJI_0726.JPG" q="0.036175 -0.998922 0.024509 -0.015672" t="-2.536858 -5.009510 91.514963" lat="35.944029617344619" lon="-90.450638476132283" alt="91.672617112396139" />
as one of the tags and I want to extract
/home/siddharth/Desktop/testing/DJI_0726.JPG
from filenames = ""
I am not able to do that can you please where my regex is wrong or something else is wrong !!
This is hald of the midx file that I am sharing here:
<dataset typename="IdxMultipleDataset" logic_box="0 7252 0 8683" physic_box="0.24874641550219023 0.24875126191231167 0.6071205757248886 0.6071264043899676">
<slam width="5472" height="3648" dtype="uint8[3]" calibration="4256.023438 2735.799316 1824.087646" />
<field name='voronoi'><code>
output=voronoi()</code>
</field>
<translate x="0.24874641550219023" y="0.60712057572488864">
<scale x="6.6824682454607912e-10" y="6.6824682454607912e-10">
<translate x="-0" y="-5.9402018165207208e-14">
<svg width="1048" height="1254" viewBox="0 0 7252 8683">
<g stroke="#000000" stroke-width="1" fill="#ffff00" fill-opacity="0.3">
<poi point="2710.006104,2372.072998" />
<poi point="2795.450439,3354.056396" />
<poi point="2846.955566,4015.307861" />
<poi point="2914.414307,4897.018555" />
<poi point="3015.048584,6234.411133" />
<poi point="4570.675293,6449.748047" />
<poi point="4437.736328,4984.978027" />
<poi point="4387.470703,4050.677002" />
</g>
</svg>
<dataset url="/home/siddharth/Desktop/testing/VisusSlamFiles/idx/0000.idx" color="#f4be4bff" quad="0 365.189 5614.56 5.9402e-14 5617.89 3728.26 331.293 3920.91" filenames="/home/siddharth/Desktop/testing/DJI_0726.JPG" q="0.036175 -0.998922 0.024509 -0.015672" t="-2.536858 -5.009510 91.514963" lat="35.944029617344619" lon="-90.450638476132283" alt="91.672617112396139" />
Thank you
CodePudding user response:
You could use a capture group and make the pattern a bit more specific without using repeated groups at all:
<dataset\b[^<>]* filenames="(\S \.JPG)"
Example
import re
pattern = r'<dataset\b[^<>]* filenames="(\S \.JPG)"'
s = "...."
print(re.findall(pattern, s))
Output
['/home/siddharth/Desktop/testing/DJI_0726.JPG']
CodePudding user response:
There are multiple problems with your regex pattern.
In Pythons re.compile(pattern, flags=...)
you specify the regex flags as parameter and don't put them into the regex pattern. The '/'
and '/im'
in "/\/(\S\w*)*.JPG/im"
are in Python interpreted as part of the regex pattern so the regex tries to find "JPG/im"
literally and fails.
In a regex pattern the '.'
has a special meaning (any single character), so you need to escape it to match the dot as such.
There is no need to put a backslash before the literal slash.
You wanted to capture repeated occurrences of slash followed by non-whitespace character so you have to put the slash inside the capturing group.
If you adjust your regex pattern according to what is stated above you will arrive at:
regexCommand = re.compile("(/\S*)*\.JPG", re.I|re.M)
what then will give you a result (not including '.JPG' as it is not included in the capturing group).
So if you want to extract any path to JPG-images including .JPG
you can use:
regexCommand = re.compile("/\S*\.JPG", flags=re.I|re.M)
Or as suggested in the other answer a more specific regex.