I am trying to write a loop in python to extract information from a sentence per row. THe input sentences look like this:
[t] troubleshooting ad-2500 and ad-2600 no picture scrolling b/w .
##repost from january 13 , 2004 with a better fit title .
i/p button[ 2]##im a more happier person after discovering the i/p button !
dvd player[ 1][p]##it practically plays almost everything you give it .
player[ 2],sound[-1]##i 've had the player for about 2 years now and it still performs nicely with the exception of an occasional wwhhhrrr sound from the motor .
I would like to extract the sentences only and use the information before the ##
as tags and write this all to a variable that then contains all the information. The expected output:
Variable: title
troubleshooting ad-2500 and ad-2600 no picture scrolling b/w .
troubleshooting ad-2500 and ad-2600 no picture scrolling b/w .
troubleshooting ad-2500 and ad-2600 no picture scrolling b/w .
troubleshooting ad-2500 and ad-2600 no picture scrolling b/w .
troubleshooting ad-2500 and ad-2600 no picture scrolling b/w .
So the variable should be maintained until a new [t]
is in the row.
Variable: sentence_only
repost from january 13 , 2004 with a better fit title .
im a more happier person after discovering the i/p button !
it practically plays almost everything you give it .
i 've had the player for about 2 years now and it still performs nicely with the exception of an occasional wwhhhrrr sound from the motor .
Variable: tag
i/p button[ 2]
dvd player[ 1][p]
player[ 2],sound[-1]
The current output only maintains the last row and not the full list in the variable.
Here is my attempt in solving this:
import nltk
from nltk.corpus import PlaintextCorpusReader
corpus_root = "Data/Customer_review_data"
filelists = PlaintextCorpusReader(corpus_root, '.*')
filelists.fileids()
rawlist = filelists.raw('Apex AD2600 Progressive-scan DVD player.txt')
sentence = rawlist.split("\n")[:]
a_line = ""
sentence_only = ""
content = ""
title = ""
tag = ""
for b_line in sentence:
if title != '' or content != '' or sentence_only != '':
content = title, tag, sentence_only
if re.match(r"^\*", b_line):
continue
if re.match(r"^\[t\][ ]", b_line):
title = b_line[4:]
continue
if re.match(r"^\[t\]", b_line):
title = b_line[3:]
continue
if re.match(r"^##", b_line):
sentence_only = b_line[2:]
continue
if re.match(r".*##", b_line):
i = len(b_line.split('##')[0]) 2
sentence_only = b_line[i:]
tag = b_line[:i-2]
continue
if re.match(r".*#", b_line):
sentence_only = b_line[2:]
continue
print(test)
CodePudding user response:
Actually, I reread your question, and it seems each file only contains one item. If that is the case, you can do this much easier.
with open("somefile.txt") as infile:
data = infile.read().splitlines() # this seems to work OS agnostic
item = {
"title": data[0][4:],
"contents": [{"tag": line.split("##")[0], "sentence": line.split("##")[1]} for line in data[1:]]
}
This will result in a dict item that is the same as the ones in the old answer below...
OLD ANSWER
I would use a list of dict items to contain the data, but you can easily adjust what variables you put the resulting data in.
from pprint import pprint
with open("somefile.txt") as infile:
data = infile.read().splitlines() # this seems to work OS agnostic
result = []
current_item = None
for line in data:
if line.startswith('[t]'):
# add everything stored sofar to result
# check is needed for the first loop
if current_item:
result.append(current_item)
current_item = {
"title": line[4:], # strip the [t] part
"contents": [] # reset the contents list
}
else:
current_item["contents"].append({
"tag": line.split("##")[0], # the first element of the split
"sentence": line.split("##")[1] # the second element of the split
})
# finally, add last item
result.append(current_item)
# usage:
for item in result:
print(f"\nTITLE: {item['title']}")
print("Variable: sentence_only")
for content in item["contents"]:
print(content["sentence"])
for item in result:
print(f"\nTITLE: {item['title']}")
print("Variable: tag")
for content in item["contents"]:
print(content["tag"])
# pprint:
pprint(result)
Output below.
Note that I just duplicated the example input and added the very imaginative NR2
to the lines to differentiate between the two "items" in the source file...
TITLE: troubleshooting ad-2500 and ad-2600 no picture scrolling b/w .
Variable: sentence_only
repost from january 13 , 2004 with a better fit title .
im a more happier person after discovering the i/p button !
it practically plays almost everything you give it .
i 've had the player for about 2 years now and it still performs nicely with the exception of an occasional wwhhhrrr sound from the motor .
TITLE: troubleshooting ad-2500 and ad-2600 no picture scrolling b/w . NR2
Variable: sentence_only
repost from january 13 , 2004 with a better fit title . NR2
im a more happier person after discovering the i/p button ! NR2
it practically plays almost everything you give it . NR2
i 've had the player for about 2 years now and it still performs nicely with the exception of an occasional wwhhhrrr sound from the motor . NR2
TITLE: troubleshooting ad-2500 and ad-2600 no picture scrolling b/w .
Variable: tag
i/p button[ 2]
dvd player[ 1][p]
player[ 2],sound[-1]
TITLE: troubleshooting ad-2500 and ad-2600 no picture scrolling b/w . NR2
Variable: tag
i/p button[ 2] NR2
dvd player[ 1][p] NR2
player[ 2],sound[-1] NR2
[{'contents': [{'sentence': 'repost from january 13 , 2004 with a better fit '
'title .',
'tag': ''},
{'sentence': 'im a more happier person after discovering the '
'i/p button ! ',
'tag': 'i/p button[ 2]'},
{'sentence': 'it practically plays almost everything you give '
'it . ',
'tag': 'dvd player[ 1][p]'},
{'sentence': "i 've had the player for about 2 years now and it "
'still performs nicely with the exception of an '
'occasional wwhhhrrr sound from the motor . ',
'tag': 'player[ 2],sound[-1]'}],
'title': 'troubleshooting ad-2500 and ad-2600 no picture scrolling b/w . '},
{'contents': [{'sentence': 'repost from january 13 , 2004 with a better fit '
'title . NR2',
'tag': ''},
{'sentence': 'im a more happier person after discovering the '
'i/p button ! NR2',
'tag': 'i/p button[ 2] NR2'},
{'sentence': 'it practically plays almost everything you give '
'it . NR2',
'tag': 'dvd player[ 1][p] NR2'},
{'sentence': "i 've had the player for about 2 years now and it "
'still performs nicely with the exception of an '
'occasional wwhhhrrr sound from the motor . NR2',
'tag': 'player[ 2],sound[-1] NR2'}],
'title': 'troubleshooting ad-2500 and ad-2600 no picture scrolling b/w . '
'NR2'}]