Sort and unique directory listing with python [closed]-CodePudding

How do I sort these files by unique "\d{8}" date string %Y%M%d in the file names then create a for loop that will do things on all the files created on each day that there is one or more files created. I can do this in the shell but it is very slow. I'm working now with python.

Sample file list. (There are 2200 files total)

Tyler Cowen On Reading 202109200657.md
On Poems 202109210659.md
Slava Akhmechet On Reading In Clusters 202109200659.md
Ideation In A 4X4 Matrix 202109200717.md
Drawing Grid Ideation 202109220830.md
Dictation 201208251425.md

Output would look like for eventual graphing with ploty.

20120825,1
20210920,3
20210921,1
20210922,1

1,3&4 would be group together I want to be able to do other stuff with the day's files like get total word count.

CodePudding user response：

If you're trying to replace a shell script, your Python script will probably need to do the following.

List the contents of a directory to get the filenames.
Extract the date from the filenames (assuming a regular expression pattern match of \d{8} is good enough to extract the date).
Sort or otherwise group the files by the extracted date.
Iterate over those groups to do something.

import pathlib
import re
import defaultdict

date_pattern = re.compile(r"\d{8}")
target_dir = pathlib.Path("myfolder")

# Files is a dictionary mapping a date to the list of files with that date
files = defaultdict(list)
for child in target_dir.iterdir():
    # Skip directories
    if child.is_dir():
        continue
    match = date_pattern.search(child.name)
    # Skip files that do not match the date pattern
    if match is None:
        continue
    file_date = match.group()
    files[file_date].append(child)

for date, names in files.items():
    for filename in names:
        # Do something
        print(date, filename)

CodePudding user response：

Is this what you need? The below code extract date from each file and appends it to the dictionary where date is the key, so you dictionary will be in the format:

{
date1: [list of files],
date2: [list of files]
}

Here is the code:

from collections import defaultdict
import re
files = ['Tyler Cowen On Reading 202109200657.md',
'On Poems 202109210659.md',
'Slava Akhmechet On Reading In Clusters 202109200659.md',
'Ideation In A 4X4 Matrix 202109200717.md',
'Drawing Grid Ideation 202109220830.md',
'Dictation 201208251425.md']

out = defaultdict(list)
for file in files:
    date = re.search(r'.*\s(\d ).md', file)
    if date:
        date = date.group(1)[:8]
        out[date].append(file)
print (out)

Output:

defaultdict(<class 'list'>, {'20210920': ['Tyler Cowen On Reading 202109200657.md', 'Slava Akhmechet On Reading In Clusters 202109200659.md', 'Ideation In A 4X4 Matrix 202109200717.md'], '20210921': ['On Poems 202109210659.md'], '20210922': ['Drawing Grid Ideation 202109220830.md'], '20120825': ['Dictation 201208251425.md']})

Please note, this code just gives the logic and does not get the list of files in the directory, you will just have to create a list of all the required files and use that list in the above code.

CodePudding user response：

There's quite a lot of stuff in this question and I would be surprised if this was set as part of a programming course. Here's the task list as I understand it:

extract timestamps from filenames-- string and list handling
normalise timestamps into dates-- date handling
sort by no of documents per day descending, then within no of documents per day, by date ascending-- stable sorts
group documents on the same date to process them in some way-- passing functions to other functions

I would highly recommend the arrow library for manipulating dates. To begin with, install arrow:

pip install arrow

then I suggest the following:

import itertools
from collections import Counter
from pathlib import Path

import arrow

docs = [
    'Tyler Cowen On Reading 202109200657.md',
    'On Poems 202109210659.md',
    'Slava Akhmechet On Reading In Clusters 202109200659.md',
    'Ideation In A 4X4 Matrix 202109200717.md',
    'Drawing Grid Ideation 202109220830.md',
    'Dictation 201208251425.md',
]


def datestamp(filename):
    basename = Path(filename).stem
    date_as_string = basename.split()[-1]
    timestamp = arrow.get(date_as_string, 'YYYYMMDDhhmm')
    return timestamp.format('YYYYMMDD')

To extract the date part from a filename, all you need here is the last part of the document's "base name", after the last space. Python's .split() method does the trick. [-1] extracts the last item in a list, so basename.split()[-1] gets everything in the file's basename after the last space.

arrow is used to parse the timestamp and reformat it as a whole date.

datestamps = [datestamp(doc) for doc in docs]
datestamps.sort()
docDates = Counter()
for date in datestamps:
    docDates[date]  = 1
for date, doc_count in docDates.most_common():
    print(f'{date},{doc_count}')

Counter() is a useful collection from the Python standard library. Its .most_common() method is used to sort dates with the most docs first:

>>> python docs.py
20210920,3
20120825,1
20210921,1
20210922,1

Note that the 1-doc dates are second-level sorted by date. The previous .sort() (before .most_common()) is a second-level sort by date because Python sorting is always "stable" (and iterating over a Counter() follows original insertion order.)

To understand stable sorting better, visit this link. It may take you a few goes to understand it.

To group documents sharing a given date, first sort the documents, then group them by the same key function:

docs.sort(key=datestamp)
for date, docs_on_date in itertools.groupby(docs, key=datestamp):
    print(date, list(docs_on_date))

This again is quite sophisticated: you're passing a function to the sort method and the groupby method.

Results:

20120825 ['Dictation 201208251425.md']
20210920 ['Tyler Cowen On Reading 202109200657.md', 'Slava Akhmechet On Reading In Clusters 202109200659.md', 'Ideation In A 4X4 Matrix 202109200717.md']
20210921 ['On Poems 202109210659.md']
20210922 ['Drawing Grid Ideation 202109220830.md']