Home > database >  Is there a better way to do this? Counting Files, and directories via for loop vs map
Is there a better way to do this? Counting Files, and directories via for loop vs map

Time:01-02

Folks,

I'm trying to optimize this to help speed up the process...

What I am doing is creating a dictionary of scandir entries... e.g.

fs_data = {}
for item in Path(fqpn).iterdir():
    # snipped out a bunch of normalization code
    fs_data[item.name.title().strip()] = item

{'file1': <file1 scandisk data>, etc}

and then later using a function to gather the count of files, and directories in the data.

Now I suspect that the new code, using map could be optimized to be faster than the old code. I suspect that having to run the list comprehension twice, once for files, and once for directories.

But I can't think of a way to optimize it to only have to run once.

Can anyone suggest a way to sum the files, and directories at the same time in the new version? (I could fall back to the old code, if necessary)

But I might be over optimizing at this point?

Any feedback would be welcome.

def new_fs_counts(fs_entries) -> (int, int):
    """
    Quickly count the files vs directories in a list of scandir entries
    Used primary by sync_database_disk to count a path's files & directories

    Parameters
    ----------
    fs_entries (list) - list of scandir entries

    Returns
    -------
    tuple - (# of files, # of dirs)

    """
    def counter(fs_entry):
        return (fs_entry.is_file(), not fs_entry.is_file())

    mapdata = list(map(counter, fs_entries.values()))
    files = sum(files for files, _ in mapdata)
    dirs = sum(dirs for _, dirs in mapdata)
    return (files, dirs)

vs

def old_fs_counts(fs_entries) -> (int, int):
    """
    Quickly count the files vs directories in a list of scandir entries
    Used primary by sync_database_disk to count a path's files & directories

    Parameters
    ----------
    fs_entries (list) - list of scandir entries

    Returns
    -------
    tuple - (# of files, # of dirs)

    """
    files = 0
    dirs = 0
    for fs_item in fs_entries:
        is_file = fs_entries[fs_item].is_file()
        files  = is_file
        dirs  = not is_file
    return (files, dirs)

CodePudding user response:

Okay, map is definitely not the right answer here.

This morning I got up and created a test using timeit... and it was a bit of a splash of reality to the face.

Without optimizations, new vs old, the new map code was roughly 2x the time.

New : 0.023185124970041215 old : 0.011841499945148826

I really ended up falling for a bit of click bait, and thought that rewriting with MAP would gain some better efficiency.

For the sake of completeness.

from timeit import timeit
import os


new = '''
def counter(fs_entry):
    files = fs_entry.is_file()

    return (files, not files)

mapdata = list(map(counter, fs_entries.values()))
files = sum(files for files, _ in mapdata)
dirs = sum(dirs for _, dirs in mapdata)
#dirs = len(fs_entries)-files
'''
#dirs = sum(dirs for _, dirs in mapdata)

old = '''
files = 0
dirs = 0
for fs_item in fs_entries:
   is_file = fs_entries[fs_item].is_file()
   files  = is_file
   dirs  = not is_file
'''

fs_location = '/Volumes/4TB_Drive/gallery/albums/collection1'
fs_data = {}
for item in os.scandir(fs_location):
    fs_data[item.name] = item

print("New : ", timeit(stmt=new, number=1000, globals={'fs_entries':fs_data}))
print("old : ", timeit(stmt=old, number=1000, globals={'fs_entries':fs_data}))

And while I was able close the gap with some optimizations.. (Thank you Lee for your suggestion)

New : 0.10864979098550975 old : 0.08246175001841038

It is clear that the for loop solution is easier to read, faster, and just simpler.

The speed difference between new and old, doesn't seem to be map specifically.

The duplicate sum statement added .021, and The biggest slow down was from the second fs_entry.is_file, it added .06x to the timings...

CodePudding user response:

To answer your question about how to only use the map list once, an alternative solution is this...

def my_fs_counts(fs_entries) -> (int, int):
 def counter(fs_entry):
  return fs_entry.is_file()

 files = sum(map(counter, fs_entries.values()))
 dirs = len(fs_entries) - files
 return (files, dirs)

But in all honesty, the original old_fs_counts is far more optimal.

CodePudding user response:

map is good here if you just do it right:

files = sum(map(os.DirEntry.is_file, fs_entries.values()))
dirs = len(fs_entries) - files

(Something with filter might be even faster, at least if most entries aren't files. Or filter with is_dir if that works for you and most entries aren't directories. Or itertools.filterfalse with is_file. Or using itertools.compress. Also, counting True with list.count or operator.countOf instead of summing bools might be faster. But all of these ideas take more code (and some also memory). I'd prefer my above way.)

  • Related