extracting data from the file read not working-CodePudding

I have a midx file that has a format the same as an XML file.

I am reading this file as:

f = open('visus8.midx', 'r')
regexCommand = re.compile("/\/(\S\w*)*.JPG/im")
for line in f:
    matches = regexCommand.findall(str(line))
    print(matches)

The file has

<dataset url="/home/siddharth/Desktop/testing/VisusSlamFiles/idx/0000.idx" color="#f4be4bff" quad="0 365.189 5614.56 5.9402e-14 5617.89 3728.26 331.293 3920.91" filenames="/home/siddharth/Desktop/testing/DJI_0726.JPG" q="0.036175 -0.998922 0.024509 -0.015672" t="-2.536858 -5.009510 91.514963" lat="35.944029617344619" lon="-90.450638476132283" alt="91.672617112396139" />

as one of the tags and I want to extract

/home/siddharth/Desktop/testing/DJI_0726.JPG

from filenames = ""

I am not able to do that can you please where my regex is wrong or something else is wrong !!

This is hald of the midx file that I am sharing here:

<dataset typename="IdxMultipleDataset" logic_box="0 7252 0 8683" physic_box="0.24874641550219023 0.24875126191231167 0.6071205757248886 0.6071264043899676">
    <slam width="5472" height="3648" dtype="uint8[3]" calibration="4256.023438 2735.799316 1824.087646" />
    <field name='voronoi'><code>
        output=voronoi()</code>
    </field>
    <translate x="0.24874641550219023" y="0.60712057572488864">
        <scale x="6.6824682454607912e-10" y="6.6824682454607912e-10">
            <translate x="-0" y="-5.9402018165207208e-14">
                <svg width="1048" height="1254" viewBox="0 0 7252 8683">
                    <g stroke="#000000" stroke-width="1" fill="#ffff00" fill-opacity="0.3">
                        <poi point="2710.006104,2372.072998" />
                        <poi point="2795.450439,3354.056396" />
                        <poi point="2846.955566,4015.307861" />
                        <poi point="2914.414307,4897.018555" />
                        <poi point="3015.048584,6234.411133" />
                        <poi point="4570.675293,6449.748047" />
                        <poi point="4437.736328,4984.978027" />
                        <poi point="4387.470703,4050.677002" />
                    </g>
                </svg>
                <dataset url="/home/siddharth/Desktop/testing/VisusSlamFiles/idx/0000.idx" color="#f4be4bff" quad="0 365.189 5614.56 5.9402e-14 5617.89 3728.26 331.293 3920.91" filenames="/home/siddharth/Desktop/testing/DJI_0726.JPG" q="0.036175 -0.998922 0.024509 -0.015672" t="-2.536858 -5.009510 91.514963" lat="35.944029617344619" lon="-90.450638476132283" alt="91.672617112396139" />

Thank you

CodePudding user response：

You could use a capture group and make the pattern a bit more specific without using repeated groups at all:

<dataset\b[^<>]* filenames="(\S \.JPG)"

Regex demo

Example

import re

pattern = r'<dataset\b[^<>]* filenames="(\S \.JPG)"'
s = "...."
print(re.findall(pattern, s))

Output

['/home/siddharth/Desktop/testing/DJI_0726.JPG']

CodePudding user response：

There are multiple problems with your regex pattern.

In Pythons re.compile(pattern, flags=...) you specify the regex flags as parameter and don't put them into the regex pattern. The '/' and '/im' in "/\/(\S\w*)*.JPG/im" are in Python interpreted as part of the regex pattern so the regex tries to find "JPG/im" literally and fails.

In a regex pattern the '.' has a special meaning (any single character), so you need to escape it to match the dot as such.

There is no need to put a backslash before the literal slash.

You wanted to capture repeated occurrences of slash followed by non-whitespace character so you have to put the slash inside the capturing group.

If you adjust your regex pattern according to what is stated above you will arrive at:

regexCommand = re.compile("(/\S*)*\.JPG", re.I|re.M)

what then will give you a result (not including '.JPG' as it is not included in the capturing group).

So if you want to extract any path to JPG-images including .JPG you can use:

regexCommand = re.compile("/\S*\.JPG", flags=re.I|re.M)

Or as suggested in the other answer a more specific regex.