I have a list of values and I need to search through an ~1 GB text file for those values and then pull a corresponding value from the text file in the same row. Is there a more efficient method than using nested for loops where I would be going through my list of values and then going into my nested loop where I would be iterating through the text file to find the corresponding matches?
for name in name_list:
with open("test.txt") as infile:
for line in infile:
currentline = line.split("|")
if name == currentline[0]:
print(currentline[2])
CodePudding user response:
This is exactly what a dictionary is for. Assuming parts[0]
is unique:
dic = {}
with open("test.txt") as infile:
for line in infile:
parts = line.split("|")
dic[parts[0]] = parts[2]
for name in name_list:
if name in dic:
print(dic[name])
if it isn't unique, you'll have to keep a list for each dict entry:
dic = {}
with open("test.txt") as infile:
for line in infile:
parts = line.split("|")
if parts[0] in dic:
dic[parts[0]].append(parts[2])
else:
dic[parts[0]] = [parts[2]]
for name in name_list:
if name in dic:
for matching in dic[name]:
print(matching)
this drastically reduces the runtime: assuming your file has n
entries and your name_list
has m
entries, you had a complexity of O(n * m)
before - now you have O(n m), since hash map access is constant time - you don't perform a linear search anymore!
In practice, the speedup will be even larger since your previous code reread the file within the O(m) loop.
Furthermore, as suggested by @KellyBundy, you can use a set
with the name_list
to only collect values for names you're interested in:
dic = {name: [] for name in name_list}
with open("test.txt") as infile:
for line in infile:
parts = line.split("|")
if parts[0] in dic:
dic[parts[0]].append(parts[2])
for matching_list in dic.values():
for matching in matching_list:
print(matching)
CodePudding user response:
The the loop through the name_list is faster than that one in the files. So you can invert loops and using set to check the value:
names_to_search_for = set(name_list)
with open("test.txt") as infile:
for line in infile:
currentline = line.split("|")
if currentline[0] in names_to_search_for:
print(currentline[0], currentline[2])
This should reduce the access to the file and the speed.
CodePudding user response:
You can put name in a set, where lookups are quick. Then as you scan the lines, you can check if your line part is in names:
names = set(name_list)
with open("test.txt") as infile:
for line in infile:
split_line = line.split("|")
if split_line[0] in names:
print(split_line[2])