Home > Software engineering >  Extract text from a specific pattern in text file using python
Extract text from a specific pattern in text file using python

Time:06-03

I have a text file from which I am trying to extract the titles to excel column. However, the required titles are within specific pattern:

COM *******************
COM * Title 1*
COM *******************

COM ***************************
COM * Sub 1 *
COM ***************************
{
...TEXT DETAILS...
}
COM ***************************
COM * Sub 2 *
COM ***************************
{
...TEXT DETAILS...
}


COM *******************
COM * Title 2*
COM *******************

COM ***************************
COM * T2 Sub 1  *
COM ***************************
{
...TEXT DETAILS...
}
COM ***************************
COM * T2 Sub 2 *
COM ***************************
{
...TEXT DETAILS...
}

The required output of string extraction (title) format is:

['Title 1', 'Sub 1',..,'T2 Sub 2']

or excel column as

CATEGORY
Title 1
Sub 1
Sub 2

Title 2
T2 Sub 1
T2 Sub 2

It is actually the 'COM *****' pattern and the middle line consisting of the title that I am unable to implement. I recently extracted required string based on string pattern which I think was similar to my current problem.

For that case i/p text file was in this format:

CTG 'GEN:LT'                               
{
TEXT DETAILS....
}

CTG 'GEN:FR'                               
{
TEXT DETAILS....
}

CTG 'GEN:G_L02'                                
{
TEXT DETAILS....
}

CTG 'GEN:ER'                               
{
TEXT DETAILS....
}

CTG 'GEN:C1' 
{
TEXT DETAILS....
}

My goal was to extract the string after CTG which is in ' ' My idea here was to detect the CTG string and print the string next to it. And here is how I implemented the same:

import re
def getCtgName(text):     
  matches = re.findall(r"'(. ?)'",text)
  return matches

mylines = []                                # Declare an empty list.
with open ('filepath.txt', 'rt') as myfile:    # Open .txt for reading text.
    for myline in myfile:                   # For each line in the file,
        mylines.append(myline.rstrip('\n')) # strip newline and add to list.

columns = []
substr = "CTG"                  # substring to search for.
for line in mylines:            # string to be searched
  if substr in line:
     columns.append(getCtgName(line)[0])
print(columns)
  

And got the output as:

['GEN:LT', 'GEN:FR',..., 'GEN:C1']

I believe similar logic can be implemented for the Title extraction between those comment (COM****) lines, any help with the code or logic or resources will be appreciated. Thank you!

CodePudding user response:

I think you can simplify this code into one regex pattern, using lookbehind and lookahead. These two techniques allow you to specify a certain part that has to come before or after the match, but which aren't included in the match itself. The syntax is (?<=text) for lookbehind and (?=text) for lookahead.

So, the part that comes before a title is COM ***************************\nCOM * and the part that comes behind is *\nCOM ***************************. When we put this in the regex syntax, the pattern is:
(?<=COM \*{27}\nCOM \*)[^\n] (?=\*\nCOM \*{27})

In python code, that becomes:

import re

with open ('filepath.txt', 'rt') as myfile:
    txt=myfile.read()

pattern=r"(?<=COM \*{27}\nCOM \*)[^\n] (?=\*\nCOM \*{27})"
titles=re.findall(pattern,txt)

Another way of doing this would be using your code first and then delete all occurences of "***************************" in the result.

An implementation:

import re
def getCtgName(text):     
  matches = re.findall(r"'(. ?)'",text)
  return matches

mylines = []                                # Declare an empty list.
with open ('filepath.txt', 'rt') as myfile:    # Open .txt for reading text.
    for myline in myfile:                   # For each line in the file,
        mylines.append(myline.rstrip('\n')) # strip newline and add to list.

titles = []
substr = "CTG"                  # substring to search for.
for line in mylines:            # string to be searched
  if substr in line:
     titles.append(getCtgName(line)[0])

while "*"*27 in titles:
    titles.remove("*"*27)

print(titles)

CodePudding user response:

simply use the following regex instead of your regex in the function getCtgName assuming that the titles and subjects will not have * as a value:

matches = re.findall(r"COM\s*\*([^*] )", text)

CodePudding user response:

I am assuming that titles won't contain * characters.

import re

headings = []

# Assuming that each line from the text file is already read and stored in a list named 'strings'
for string in strings:
    sub_string = re.search('COM \*([^*] )\*', string)
    if sub_string:
        headings.append(sub_string.group(1).strip())

Input:

strings = [
    'COM *******************',
    'COM * Title 1*',
    'COM *******************',
    'COM ***************************',
    'COM * Sub 1 *',
    'COM ***************************',
    '{',
    '...TEXT DETAILS...',
    '}',
    'COM ***************************',
    'COM * Sub 2 *',
    'COM ***************************',
    '{',
    '...TEXT DETAILS...',
    '}',
    'COM *******************',
    'COM * Title 2*',
    'COM *******************',
    'COM ***************************',
    'COM * T2 Sub 1  *',
    'COM ***************************',
    '{',
    '...TEXT DETAILS...',
    '}',
    'COM ***************************',
    'COM * T2 Sub 2 *',
    'COM ***************************',
    '{',
    '...TEXT DETAILS...',
    '}',
]

Output:

['Title 1', 'Sub 1', 'Sub 2', 'Title 2', 'T2 Sub 1', 'T2 Sub 2']
  • Related