For loop writing rows into variables-CodePudding

I am trying to write a loop in python to extract information from a sentence per row. THe input sentences look like this:

[t] troubleshooting ad-2500 and ad-2600 no picture scrolling b/w . 
##repost from january 13 , 2004 with a better fit title .
i/p button[ 2]##im a more happier person after discovering the i/p button ! 
dvd player[ 1][p]##it practically plays almost everything you give it . 
player[ 2],sound[-1]##i 've had the player for about 2 years now and it still performs nicely with the exception of an occasional wwhhhrrr sound from the motor .

I would like to extract the sentences only and use the information before the ## as tags and write this all to a variable that then contains all the information. The expected output:

Variable: title
troubleshooting ad-2500 and ad-2600 no picture scrolling b/w . 
troubleshooting ad-2500 and ad-2600 no picture scrolling b/w . 
troubleshooting ad-2500 and ad-2600 no picture scrolling b/w . 
troubleshooting ad-2500 and ad-2600 no picture scrolling b/w . 
troubleshooting ad-2500 and ad-2600 no picture scrolling b/w .

So the variable should be maintained until a new [t] is in the row.

Variable: sentence_only

repost from january 13 , 2004 with a better fit title .
im a more happier person after discovering the i/p button ! 
it practically plays almost everything you give it . 
i 've had the player for about 2 years now and it still performs nicely with the exception of an occasional wwhhhrrr sound from the motor .

Variable: tag


i/p button[ 2]
dvd player[ 1][p]
player[ 2],sound[-1]

The current output only maintains the last row and not the full list in the variable.

Here is my attempt in solving this:

import nltk
from nltk.corpus import PlaintextCorpusReader
corpus_root = "Data/Customer_review_data"

filelists = PlaintextCorpusReader(corpus_root, '.*')

filelists.fileids()

rawlist = filelists.raw('Apex AD2600 Progressive-scan DVD player.txt')

sentence = rawlist.split("\n")[:]

a_line = ""
sentence_only = ""
content = ""
title = ""
tag = ""

for b_line in sentence:
    if title != '' or content != '' or sentence_only != '':
        content = title, tag, sentence_only
    if re.match(r"^\*", b_line):
        continue
    if re.match(r"^\[t\][ ]", b_line):
        title = b_line[4:]
        continue
    if re.match(r"^\[t\]", b_line):
        title = b_line[3:]
        continue
    if re.match(r"^##", b_line):
        sentence_only = b_line[2:]
        continue
    if re.match(r".*##", b_line):
        i = len(b_line.split('##')[0]) 2
        sentence_only = b_line[i:]
        tag = b_line[:i-2]
        continue
    if re.match(r".*#", b_line):
        sentence_only = b_line[2:]
        continue
print(test)

CodePudding user response：

Actually, I reread your question, and it seems each file only contains one item. If that is the case, you can do this much easier.

with open("somefile.txt") as infile:
    data = infile.read().splitlines() # this seems to work OS agnostic

item = {
    "title": data[0][4:],
    "contents": [{"tag": line.split("##")[0], "sentence": line.split("##")[1]} for line in data[1:]]
}

This will result in a dict item that is the same as the ones in the old answer below...

OLD ANSWER

I would use a list of dict items to contain the data, but you can easily adjust what variables you put the resulting data in.

from pprint import pprint


with open("somefile.txt") as infile:
    data = infile.read().splitlines() # this seems to work OS agnostic

result = []
current_item = None
for line in data:
    if line.startswith('[t]'):
        # add everything stored sofar to result
        # check is needed for the first loop
        if current_item:
            result.append(current_item)
        current_item = {
            "title": line[4:],    # strip the [t] part
            "contents": []        # reset the contents list
            } 
    else:
        current_item["contents"].append({
            "tag": line.split("##")[0],     # the first element of the split
            "sentence": line.split("##")[1] # the second element of the split
        })
# finally, add last item
result.append(current_item) 


# usage:
for item in result:
    print(f"\nTITLE: {item['title']}")
    print("Variable: sentence_only")
    for content in item["contents"]:
        print(content["sentence"])

for item in result:
    print(f"\nTITLE: {item['title']}")
    print("Variable: tag")
    for content in item["contents"]:
        print(content["tag"])

# pprint:
pprint(result)

Output below.
Note that I just duplicated the example input and added the very imaginative NR2 to the lines to differentiate between the two "items" in the source file...

TITLE: troubleshooting ad-2500 and ad-2600 no picture scrolling b/w . 
Variable: sentence_only
repost from january 13 , 2004 with a better fit title .
im a more happier person after discovering the i/p button !
it practically plays almost everything you give it .
i 've had the player for about 2 years now and it still performs nicely with the exception of an occasional wwhhhrrr sound from the motor .     

TITLE: troubleshooting ad-2500 and ad-2600 no picture scrolling b/w . NR2
Variable: sentence_only
repost from january 13 , 2004 with a better fit title . NR2
im a more happier person after discovering the i/p button !  NR2
it practically plays almost everything you give it .  NR2
i 've had the player for about 2 years now and it still performs nicely with the exception of an occasional wwhhhrrr sound from the motor .  NR2

TITLE: troubleshooting ad-2500 and ad-2600 no picture scrolling b/w .
Variable: tag

i/p button[ 2]
dvd player[ 1][p]
player[ 2],sound[-1]

TITLE: troubleshooting ad-2500 and ad-2600 no picture scrolling b/w . NR2
Variable: tag

i/p button[ 2] NR2
dvd player[ 1][p] NR2
player[ 2],sound[-1] NR2
[{'contents': [{'sentence': 'repost from january 13 , 2004 with a better fit '
                            'title .',
                'tag': ''},
               {'sentence': 'im a more happier person after discovering the '
                            'i/p button ! ',
                'tag': 'i/p button[ 2]'},
               {'sentence': 'it practically plays almost everything you give '
                            'it . ',
                'tag': 'dvd player[ 1][p]'},
               {'sentence': "i 've had the player for about 2 years now and it "
                            'still performs nicely with the exception of an '
                            'occasional wwhhhrrr sound from the motor . ',
                'tag': 'player[ 2],sound[-1]'}],
  'title': 'troubleshooting ad-2500 and ad-2600 no picture scrolling b/w . '},
 {'contents': [{'sentence': 'repost from january 13 , 2004 with a better fit '
                            'title . NR2',
                'tag': ''},
               {'sentence': 'im a more happier person after discovering the '
                            'i/p button !  NR2',
                'tag': 'i/p button[ 2] NR2'},
               {'sentence': 'it practically plays almost everything you give '
                            'it .  NR2',
                'tag': 'dvd player[ 1][p] NR2'},
               {'sentence': "i 've had the player for about 2 years now and it "
                            'still performs nicely with the exception of an '
                            'occasional wwhhhrrr sound from the motor .  NR2',
                'tag': 'player[ 2],sound[-1] NR2'}],
  'title': 'troubleshooting ad-2500 and ad-2600 no picture scrolling b/w . '
           'NR2'}]