Got this text file:
1e.jpg#0 A dog going for a walk .
2e.jpg#1 A boy is going to swim
3e.jpg#2 A girl is chasing the cat .
4e.jpg#3 Three people are going to a hockey game
I need to split it into two separate lists. One list for IDs and the second for the sentences. This is where I need help as now I need to split the sentences list into the following:
[["a", "dog", "going", "for", "a"...], ["a",......]]
This is how far I got
path = "s.txt"
l1 = []
l2 = []
read_file=open(path, "r")
split = [line.strip() for line in read_file]
for line in split:
l1.append(line.split("\t")[0])
l2.append(line.split("\t")[1:])
print(l2)
CodePudding user response:
You can use the same principle. The split
function splits on whitespace by default. I also removed the :
from l2.append(line.split("\t")[1:])
so that it returns a string instead of a list with one element:
path = "s.txt"
l1 = []
l2 = []
read_file=open(path, "r")
split = [line.strip() for line in read_file]
for line in split:
l1.append(line.split("\t")[0])
l2.append(line.split("\t")[1])
words_list = []
for s in l2:
words_list.append(s.split())
print(words_list)
CodePudding user response:
If you don't care about punctuation being added to your lists, you can just split your string in your current code (assuming only one tab character occurs):
l1 = []
l2 = []
read_file=open(path, "r")
split = [line.strip() for line in read_file]
for line in split:
l1.append(line.split("\t")[0])
l2.append(line.split("\t")[1].split())
print(l2)
Output:
[['A', 'dog', 'going', 'for', 'a', 'walk', '.'], ['A', 'boy', 'is', 'going', 'to', 'swim'], ['A', 'girl', 'is', 'chasing', 'the', 'cat', '.'], ['Three', 'people', 'are', 'going', 'to', 'a', 'hockey', 'game']]
If you want to remove non-word elements, you can use re.split
:
import re
split_pattern = re.compile(r'\W? \W?')
l1 = []
l2 = []
read_file=open(path, "r")
split = [line.strip() for line in read_file]
for line in split:
l1.append(line.split("\t")[0])
word_list = [x for x in re.split(split_pattern, line.split("\t")[1]) if x]
l2.append(word_list)
print(l2)
Output:
[['A', 'dog', 'going', 'for', 'a', 'walk'], ['A', 'boy', 'is', 'going', 'to', 'swim'], ['A', 'girl', 'is', 'chasing', 'the', 'cat'], ['Three', 'people', 'are', 'going', 'to', 'a', 'hockey', 'game']]