I have a fairly large file which I have to parse. I have successfully broken it into a list of strings. The next step is to take each string and break it into a list of tuples. The first element of each tuple should be the number at the beginning of the string. The second element should be the sentence itself. This is fairly easy to accomplish using the .split method. The catch is, I need the 0th index of each tuple (the number) to be an integer, not a string. What are some possible methods of achieving this in a way that isn't hacky? This is my attempt so far along with some of the input from the file I'm using.
Input:
"0 giuliani recalled that trump initially called it a muslim ban
0 trump was seen yesterday on television in mcdonalds commercials
0 illustration newsday photo by jon naso donald trump whom a casino analyst is suing for 2 million over trumps response to the analysts dire predictions for the taj mahal in atlantic city
0 donald trump beats hillary clinton on li by just under 20000 votes
0 and an outlier a pro trump painter depicting a snake stomping president surrounded by a young family cops miners and the military"
newbfile = r'train_orig.txt'
def getlines(filename):
with open(filename, encoding='utf8') as fn:
fn_list = [line.rstrip() for line in fn]
return fn_list
def makeSentences(lines):
x = [tuple(line.split('\t')) for line in lines]
for index, number in enumerate(x):
number[0] = int(number[0])
return x
Output:
TypeError: 'tuple' object does not support item assignment
CodePudding user response:
Use str.split
with maxsplit=
argument:
s = """\
0 giuliani recalled that trump initially called it a muslim ban
0 trump was seen yesterday on television in mcdonalds commercials
0 illustration newsday photo by jon naso donald trump whom a casino analyst is suing for 2 million over trumps response to the analysts dire predictions for the taj mahal in atlantic city
0 donald trump beats hillary clinton on li by just under 20000 votes
0 and an outlier a pro trump painter depicting a snake stomping president surrounded by a young family cops miners and the military"""
for line in s.splitlines():
line = line.split(maxsplit=1)
my_tuple = int(line[0]), line[1]
print(my_tuple)
Prints:
(0, 'giuliani recalled that trump initially called it a muslim ban')
(0, 'trump was seen yesterday on television in mcdonalds commercials')
(0, 'illustration newsday photo by jon naso donald trump whom a casino analyst is suing for 2 million over trumps response to the analysts dire predictions for the taj mahal in atlantic city')
(0, 'donald trump beats hillary clinton on li by just under 20000 votes')
(0, 'and an outlier a pro trump painter depicting a snake stomping president surrounded by a young family cops miners and the military')
CodePudding user response:
You can use tuple unpacking with the result of .split
:
tuples = []
with open(filename, encoding='utf8') as fn:
for line in fn:
split_line = line.rstrip().split('\t', maxsplit=1)
(number, sentence) = split_line
formatted_tuple = (int(number), sentence)
tuples.append(formatted_tuple)
I've included the variables split_line
and formatted_tuple
for clarity, though they are not necessary here.
CodePudding user response:
Tuple is immutable in python.
You can use this approach.
re
used to split the string.
import re
def makeSentences(lines):
x = [re.split(r'\s{2,}', line) for line in lines]
x = [(int(number),string) for number, string in x]
return x
Result:
[(0, 'giuliani recalled that trump initially called it a muslim ban'),
(0, 'trump was seen yesterday on television in mcdonalds commercials'),
(0,
'illustration newsday photo by jon naso donald trump whom a casino analyst is suing for 2 million over trumps response to the analysts dire predictions for the taj mahal in atlantic city'),
(0, 'donald trump beats hillary clinton on li by just under 20000 votes'),
(0,
'and an outlier a pro trump painter depicting a snake stomping president surrounded by a young family cops miners and the military')]