Home > Blockchain >  read the text of a file between 2 words in python
read the text of a file between 2 words in python

Time:06-04

I am trying to open, read and extract the content (fragment) that is between 2 words (which are opening and closing profile, also included) of an .xml locating the fragment by means of a keyword that I introduce and write only that fragment (between 2 tags) in another new .xml that I generate.

Currently the python script that I have allows me to open, read the source .xml file, search for the keyword that I introduce in the text and return those complete lines where the keyword is found by writing them in a new .xml file that I generate as follows:

keyword = 'Georgia'
occurrences = []
with open('test_input.xml') as lines:
    for line in lines:
        if keyword in line:
            occurrences.append(line)

archi1=open("test_output.xml","w") 
archi1.write(''.join(occurrences))
archi1.close() 

The result I get is a "test_output.xml" file that contains the following:

     <id>Georgia-1</id>
         <profile>Georgia-p1</profile>
     <id>Georgia-2</id>
         <profile>Georgia-p2</profile>

And the problem is that I not only need it to return the complete lines that contain the keyword (in this case 'Georgia') but also the entire fragment that contains those two words and that is delimited between the opening and the closing of the word or tag 'profile', that is, I need it to return the following result:

<profile>
    <id>Georgia-1</id>
    <activation>
        <activeByDefault>false</activeByDefault>
    </activation>
    <properties>
        <profile>Georgia-p1</profile>
        <showtitle>Georgia_s1</showtitle>
        <ip>000.000.0.3</ip>
        <port>00003</port>
        <persistencePort>00033</persistencePort>
        <defaultLocale>en_GB</defaultLocale>
        <webstart.server.name>host_3</webstart.server.name>
        <codebaseProtocolServer>T3</codebaseProtocolServer>
    </properties>
</profile>
<profile>
    <id>Georgia-2</id>
    <activation>
        <activeByDefault>false</activeByDefault>
    </activation>
    <properties>
        <profile>Georgia-p2</profile>
        <showtitle>Georgia_s2</showtitle>
        <ip>000.000.0.4</ip>
        <port>00004</port>
        <persistencePort>00044</persistencePort>
        <defaultLocale>en_GB</defaultLocale>
        <webstart.server.name>host_4</webstart.server.name>
        <codebaseProtocolServer>T4</codebaseProtocolServer>
    </properties>
</profile>

The full source .xml I am using is as follows:

<project>       

    
<profile>
    <id>Azerbaiyan-1</id>
    <activation>
        <activeByDefault>false</activeByDefault>
    </activation>
    <properties>
        <profile>Azerbaiyan-p1</profile>
        <showtitle>Azerbaiyan_s1</showtitle>
        <ip>000.000.0.1</ip>
        <port>00001</port>
        <persistencePort>00011</persistencePort>
        <defaultLocale>en_GB</defaultLocale>
        <webstart.server.name>host_1</webstart.server.name>
        <codebaseProtocolServer>T1</codebaseProtocolServer>
    </properties>
</profile>

<profile>
    <id>Azerbaiyan-2</id>
    <activation>
        <activeByDefault>false</activeByDefault>
    </activation>
    <properties>
        <profile>Azerbaiyan-p2</profile>
        <showtitle>Azerbaiyan_s2</showtitle>
        <ip>000.000.0.2</ip>
        <port>00002</port>
        <persistencePort>00022</persistencePort>
        <defaultLocale>en_GB</defaultLocale>
        <webstart.server.name>host_2</webstart.server.name>
        <codebaseProtocolServer>T2</codebaseProtocolServer>
    </properties>
</profile>

<profile>
    <id>Georgia-1</id>
    <activation>
        <activeByDefault>false</activeByDefault>
    </activation>
    <properties>
        <profile>Georgia-p1</profile>
        <showtitle>Georgia_s1</showtitle>
        <ip>000.000.0.3</ip>
        <port>00003</port>
        <persistencePort>00033</persistencePort>
        <defaultLocale>en_GB</defaultLocale>
        <webstart.server.name>host_3</webstart.server.name>
        <codebaseProtocolServer>T3</codebaseProtocolServer>
    </properties>
</profile>
<profile>
    <id>Georgia-2</id>
    <activation>
        <activeByDefault>false</activeByDefault>
    </activation>
    <properties>
        <profile>Georgia-p2</profile>
        <showtitle>Georgia_s2</showtitle>
        <ip>000.000.0.4</ip>
        <port>00004</port>
        <persistencePort>00044</persistencePort>
        <defaultLocale>en_GB</defaultLocale>
        <webstart.server.name>host_4</webstart.server.name>
        <codebaseProtocolServer>T4</codebaseProtocolServer>
    </properties>
</profile>

<profile>
    <id>USA-1</id>
    <activation>
        <activeByDefault>false</activeByDefault>
    </activation>
    <properties>
        <profile>USA-p1</profile>
        <showtitle>USA1_s1</showtitle>
        <ip>000.000.0.5</ip>
        <port>00005</port>
        <persistencePort>00055</persistencePort>
        <defaultLocale>en_GB</defaultLocale>
        <webstart.server.name>host_5</webstart.server.name>
        <codebaseProtocolServer>T5</codebaseProtocolServer>
    </properties>
</profile>

<profile>
    <id>USA-2</id>
    <activation>
        <activeByDefault>false</activeByDefault>
    </activation>
    <properties>
        <profile>USA-p2</profile>
        <showtitle>USA1_s2</showtitle>
        <ip>000.000.0.6</ip>
        <port>00006</port>
        <persistencePort>00066</persistencePort>
        <defaultLocale>en_GB</defaultLocale>
        <webstart.server.name>host_6</webstart.server.name>
        <codebaseProtocolServer>T6</codebaseProtocolServer>
    </properties>
</profile>

CodePudding user response:

Parse the input as XML and capture the profile elements that have an id child element whose text value contains the string "Georgia".

The following program uses the ElementTree standard library and outputs the wanted result:

import xml.etree.ElementTree as ET

tree = ET.parse("input.xml")

# Iterate over all 'profile' elements
for profile in tree.findall("profile"):
    id = profile.find("id").text
    if "Georgia" in id:
        print(ET.tostring(profile).decode())

CodePudding user response:

Now I have adapted the code that they have sent me to be able to write the result in a file and it looks like this:

tree = ET.parse("input.xml")

for profile in tree.findall("profile"):
id = profile.find("id").text
if "Georgia" in id:
    archi1=open("output.xml","a")
    archi1.write(ET.tostring(profile).decode())
    archi1.close() 

this returns me an output.xml file with the following result:

<profile>
 <id>Georgia-1</id>
 <activation>
    <activeByDefault>false</activeByDefault>
</activation>
<properties>
    <profile>Georgia-p1</profile>
    <showtitle>Georgia_s1</showtitle>
    <ip>000.000.0.3</ip>
    <port>00003</port>
    <persistencePort>00033</persistencePort>
    <defaultLocale>en_GB</defaultLocale>
    <webstart.server.name>host_3</webstart.server.name>
    <codebaseProtocolServer>T3</codebaseProtocolServer>
</properties>
</profile>
<profile>
<id>Georgia-2</id>
<activation>
    <activeByDefault>false</activeByDefault>
</activation>
<properties>
    <profile>Georgia-p2</profile>
    <showtitle>Georgia_s2</showtitle>
    <ip>000.000.0.4</ip>
    <port>00004</port>
    <persistencePort>00044</persistencePort>
    <defaultLocale>en_GB</defaultLocale>
    <webstart.server.name>host_4</webstart.server.name>
    <codebaseProtocolServer>T4</codebaseProtocolServer>
 </properties>
</profile>

This is the expected result!! but There is something I must be doing wrong when writing the file because when I want to change the keyword to something else like "USA" on the same file, instead of overwriting it what it does is this:

<profile>
<id>Georgia-1</id>
<activation>
    <activeByDefault>false</activeByDefault>
</activation>
<properties>
    <profile>Georgia-p1</profile>
    <showtitle>Georgia_s1</showtitle>
    <ip>000.000.0.3</ip>
    <port>00003</port>
    <persistencePort>00033</persistencePort>
    <defaultLocale>en_GB</defaultLocale>
    <webstart.server.name>host_3</webstart.server.name>
    <codebaseProtocolServer>T3</codebaseProtocolServer>
</properties>
</profile>
<profile>
<id>Georgia-2</id>
<activation>
    <activeByDefault>false</activeByDefault>
</activation>
<properties>
    <profile>Georgia-p2</profile>
    <showtitle>Georgia_s2</showtitle>
    <ip>000.000.0.4</ip>
    <port>00004</port>
    <persistencePort>00044</persistencePort>
    <defaultLocale>en_GB</defaultLocale>
    <webstart.server.name>host_4</webstart.server.name>
    <codebaseProtocolServer>T4</codebaseProtocolServer>
</properties>
</profile>

<profile>
<id>USA-1</id>
<activation>
    <activeByDefault>false</activeByDefault>
</activation>
<properties>
    <profile>USA-p1</profile>
    <showtitle>USA1_s1</showtitle>
    <ip>000.000.0.5</ip>
    <port>00005</port>
    <persistencePort>00055</persistencePort>
    <defaultLocale>en_GB</defaultLocale>
    <webstart.server.name>host_5</webstart.server.name>
    <codebaseProtocolServer>T5</codebaseProtocolServer>
</properties>
</profile>

<profile>
<id>USA-2</id>
<activation>
    <activeByDefault>false</activeByDefault>
</activation>
<properties>
    <profile>USA-p2</profile>
    <showtitle>USA1_s2</showtitle>
    <ip>000.000.0.6</ip>
    <port>00006</port>
    <persistencePort>00066</persistencePort>
    <defaultLocale>en_GB</defaultLocale>
    <webstart.server.name>host_6</webstart.server.name>
    <codebaseProtocolServer>T6</codebaseProtocolServer>
</properties>
</profile>

What am I doing wrong or what am I not doing so that it overwrites the result and does not add it to the previous one??

  • Related