Home > Mobile >  how to split txt file into two lists and than split one list to its captions
how to split txt file into two lists and than split one list to its captions

Time:12-02

Got this text file:

1e.jpg#0   A dog going for a walk .
2e.jpg#1   A boy is going to swim 
3e.jpg#2   A girl is chasing the cat .
4e.jpg#3   Three people are going to a hockey game

I need to split it into two separate lists. One list for IDs and the second for the sentences. This is where I need help as now I need to split the sentences list into the following:

[["a", "dog", "going", "for", "a"...], ["a",......]] 

This is how far I got

path = "s.txt"

l1 = []
l2 = []
read_file=open(path, "r")
split = [line.strip() for line in read_file]
for line in split:
    l1.append(line.split("\t")[0])
    l2.append(line.split("\t")[1:])
    
print(l2)

CodePudding user response:

You can use the same principle. The split function splits on whitespace by default. I also removed the : from l2.append(line.split("\t")[1:]) so that it returns a string instead of a list with one element:

path = "s.txt"

l1 = []
l2 = []
read_file=open(path, "r")
split = [line.strip() for line in read_file]
for line in split:
    l1.append(line.split("\t")[0])
    l2.append(line.split("\t")[1])
    
words_list = []
for s in l2:
    words_list.append(s.split())

print(words_list)

CodePudding user response:

If you don't care about punctuation being added to your lists, you can just split your string in your current code (assuming only one tab character occurs):

l1 = []
l2 = []
read_file=open(path, "r")
split = [line.strip() for line in read_file]
for line in split:
    l1.append(line.split("\t")[0])
    l2.append(line.split("\t")[1].split())
    
print(l2)

Output:

[['A', 'dog', 'going', 'for', 'a', 'walk', '.'], ['A', 'boy', 'is', 'going', 'to', 'swim'], ['A', 'girl', 'is', 'chasing', 'the', 'cat', '.'], ['Three', 'people', 'are', 'going', 'to', 'a', 'hockey', 'game']]

If you want to remove non-word elements, you can use re.split:

import re
split_pattern = re.compile(r'\W? \W?')

l1 = []
l2 = []
read_file=open(path, "r")
split = [line.strip() for line in read_file]
for line in split:
    l1.append(line.split("\t")[0])
    word_list = [x for x in re.split(split_pattern, line.split("\t")[1]) if x]
    l2.append(word_list)
    
print(l2)

Output:

[['A', 'dog', 'going', 'for', 'a', 'walk'], ['A', 'boy', 'is', 'going', 'to', 'swim'], ['A', 'girl', 'is', 'chasing', 'the', 'cat'], ['Three', 'people', 'are', 'going', 'to', 'a', 'hockey', 'game']]
  • Related