Most efficient way to loop through a file's values and check a dictionary for any/all correspon-CodePudding

I have a file with user's names, one per line, and I need to compare each name in the file to all values in a csv file and make note each time the user name appears in the csv file. I need to make the search as efficient as possible as the csv file is 40K lines long

My example persons.txt file:

Smith, Robert
Samson, David
Martin, Patricia
Simpson, Marge

My example locations.csv file:

GreaterLocation,LesserLocation,GroupName,DisplayName,InBook
NorthernHemisphere,UnitedStates,Pilots,"Wilbur, Andy, super pilot",Yes
WesternHemisphere,China,Pilots,"Kirby, Mabry, loves pizza",Yes
WesternHemisphere,Japan,Drivers,"Samson, David, big kahuna",Yes
NortherHemisphere,Canada,Drivers,"Randos, Jorge",Yes
SouthernHemispher,Australia,Mechanics,"Freeman, Gordon",Yes
NortherHemisphere,Mexico,Pilots,"Simpson, Marge",Yes
SouthernHemispher,New Zealand,Mechanics,"Samson, David",Yes

My Code:

import csv

def parse_files():

    with open('data_file/persons.txt', 'r') as user_list:
        lines = user_list.readlines()
        for user_row in lines:
            new_user = user_row.strip()
            per = []
            with open('data_file/locations.csv', newline='') as target_csv:
                DictReader_loc = csv.DictReader(target_csv)
            
                for loc_row in DictReader_loc:
                    if new_user.lower() in loc_row['DisplayName'].lower():
                        per.append(DictReader_loc.line_num)
                        print(DictReader_loc.line_num, loc_row['DisplayName'])
            if len(per) > 0:
                print("\n" new_user, per)
    print("Parse Complete")
        
def main():
    parse_files()

main()

My code currently works. Based on the sample data in the example files, the code matches the 2 instances of "Samson, David" and 1 instance of "Simpson, Marge" in the locations.csv file. I'm hoping that someone can give me guidance on how I might transform either the persons.txt file or the locations.csv file (40K lines) so that the process is as efficient as it can be. I think it currently takes 10-15 minutes. I know looping isn't the most efficient, but I do need to check each name and see where it appears in the csv file.

CodePudding user response：

A bit of a creative solution, but it has the advantage of being very expressive (as opposed to code that loops and slices up strings manually), versatile, and fast (SQLite is pretty darn fast).

Loading everything into an SQLite database:

import csv
import sqlite3

db = sqlite3.connect(':memory:')

db.executescript('''
CREATE TABLE person (name TEXT PRIMARY KEY);
CREATE TABLE location (GreaterLocation TEXT, LesserLocation TEXT, GroupName TEXT, DisplayName TEXT, InBook TEXT);
''')

with open('data_file/locations.csv', newline='') as loc_csv:
    db.executemany(
        'INSERT INTO location VALUES (:GreaterLocation, :LesserLocation, :GroupName, :DisplayName, :InBook);',
        csv.DictReader(loc_csv))

with open('data_file/persons.txt') as persons:
    db.executemany('INSERT INTO person VALUES (?);', ([p.strip()] for p in persons))

Now we could e.g. query easily where people are at, by comparing the start of the DisplayName to the person's name:

matches = db.execute('''
    SELECT
        p.Name,
        COUNT(l.LesserLocation) NumLocations,
        GROUP_CONCAT(l.LesserLocation, '; ') Locations
    FROM
        person p
        LEFT JOIN location l ON l.DisplayName LIKE p.Name || '%'
    GROUP BY
        p.Name
''')

for row in matches:
    print(row)

this prints

('Martin, Patricia', 0, None)
('Samson, David', 2, 'Japan; New Zealand')
('Simpson, Marge', 1, 'Mexico')
('Smith, Robert', 0, None)

CodePudding user response：

I think @Tomalak's solution with SQLite is very useful, but if you want to keep it closer to your original code, see the version below.

Effectively, it reduces the amount of file opening/closing/reading that is going on, and hopefully will speed things up.

Since your sample is very small, I could not do any real measurements.

Going forward, you can consider using pandas for these kind of tasks - it can be very convenient working with CSVs and more optimized than the csv module.

import csv

def parse_files():
    with open('persons.txt', 'r') as user_list:
        # sets are faster to match against than lists
        # do the lower() here to avoid repetition
        user_set  = set([u.strip().lower() for u in user_list.readlines()])
    # open file at beginning, close after done
    # you could also encapsulate the whole thing into a `with` clause if
    # desired
    target_csv = open("locations.csv", "r", newline='')
    DictReader_loc = csv.DictReader(target_csv)
    for user in user_set:
        per = []
        for loc_row in DictReader_loc:
            if user in loc_row['DisplayName'].lower():
                per.append(DictReader_loc.line_num)
                print(DictReader_loc.line_num, loc_row['DisplayName'])
        if len(per) > 0:
            print("\n" user, per)
    print("Parse Complete")
    target_csv.close()
        
def main():
    parse_files()

main()