Home > database >  Sort, group and process files based on an embedded timestamp in the filename [closed]
Sort, group and process files based on an embedded timestamp in the filename [closed]

Time:09-29

How do I sort these files by a date string embedded in each filename? And then I would like to loop over all the files created on the same day.

I can do this in the shell but it is very slow. I'd like to do the same in python.

Sample file list (there are 2200 files total)

  1. Tyler Cowen On Reading 202109200657.md
  2. On Poems 202109210659.md
  3. Slava Akhmechet On Reading In Clusters 202109200659.md
  4. Ideation In A 4X4 Matrix 202109200717.md
  5. Drawing Grid Ideation 202109220830.md
  6. Dictation 201208251425.md

Output would look like this (for eventual graphing with Plotly.)

20120825,1  
20210920,3  
20210921,1  
20210922,1  

I want to sort by doc count on a given day, then within doc count by date. So results 1, 3 and 4 above would be listed in date order:

20210920,3
20120825,1  
20210921,1  
20210922,1  

Then I would like to do other stuff with each day's documents like get total word count for the day.

CodePudding user response:

Is this what you need? The below code extract date from each file and appends it to the dictionary where date is the key, so you dictionary will be in the format:

{
date1: [list of files],
date2: [list of files]
}

Here is the code:

from collections import defaultdict
import re
files = ['Tyler Cowen On Reading 202109200657.md',
'On Poems 202109210659.md',
'Slava Akhmechet On Reading In Clusters 202109200659.md',
'Ideation In A 4X4 Matrix 202109200717.md',
'Drawing Grid Ideation 202109220830.md',
'Dictation 201208251425.md']

out = defaultdict(list)
for file in files:
    date = re.search(r'.*\s(\d ).md', file)
    if date:
        date = date.group(1)[:8]
        out[date].append(file)
print (out)

Output:

defaultdict(<class 'list'>, {'20210920': ['Tyler Cowen On Reading 202109200657.md', 'Slava Akhmechet On Reading In Clusters 202109200659.md', 'Ideation In A 4X4 Matrix 202109200717.md'], '20210921': ['On Poems 202109210659.md'], '20210922': ['Drawing Grid Ideation 202109220830.md'], '20120825': ['Dictation 201208251425.md']})

Please note, this code just gives the logic and does not get the list of files in the directory, you will just have to create a list of all the required files and use that list in the above code.

CodePudding user response:

Here's the task list as I understand it.

  1. extract a string timestamp from a filename-- string and list handling

  2. normalise timestamps (including hours, minutes, seconds) into datestamps (year-month-day only) to group documents on a single day-- date handling

  3. sort by no of documents per day descending, then within no of documents per day, by date ascending-- stable sorts

  4. group documents on the same date to process them in some way-- passing functions to other functions

This covers quite a bit of ground in Python programming so I'll explain as I go along.

I recommend the arrow library for manipulating dates. To begin with, install arrow:

pip install arrow

import itertools
from collections import Counter
from pathlib import Path

import arrow

docs = [
    'Tyler Cowen On Reading 202109200657.md',
    'On Poems 202109210659.md',
    'Slava Akhmechet On Reading In Clusters 202109200659.md',
    'Ideation In A 4X4 Matrix 202109200717.md',
    'Drawing Grid Ideation 202109220830.md',
    'Dictation 201208251425.md',
]


def datestamp(filename):
    basename = Path(filename).stem
    date_as_string = basename.split()[-1]
    timestamp = arrow.get(date_as_string, 'YYYYMMDDhhmm')
    return timestamp.format('YYYYMMDD')

To extract the date part from a filename, you need the last part of the document's "base name", after the last space. Python's .split() method splits a string into a list at whitespace (spaces, tabs etc):

>>> basename = 'On Poems 202109210659'
>>> basename.split()
['On', 'Poems', '202109210659']

a_list[-1] extracts the last item in a list, so:


>>> basename.split()[-1]
'202109210659'

gets everything in the file's basename after the last space.

Then arrow is used to convert the timestamp into a datestamp so that documents from the same day are grouped together.

202109210659 -> 20210921

For Plotly data:

datestamps = [datestamp(doc) for doc in docs]
datestamps.sort()
docDates = Counter()
for date in datestamps:
    docDates[date]  = 1
for date, doc_count in docDates.most_common():
    print(f'{date},{doc_count}')

Counter() is a useful class from the Python standard library. Its .most_common() method is used to sort dates with the most docs first:

>>> python docs.py
20210920,3
20120825,1
20210921,1
20210922,1

Note that the 1-doc dates are second-level sorted by date. The datestamps.sort() (before .most_common()) is a second-level sort by date because Python library sorting functions are "stable". Iterating over a Counter() follows original insertion order, so .most_common() will preserve the original date order from datestamps.sort() in cases where document count is the same.

To understand stable sorting better, see this answer. It may take you a few goes to understand it.

To group documents sharing a given date, first sort the documents by the datestamp function, then group them by the same function. This lets you process all the documents associated with a single date (for daily word counts etc.) The datestamp "key" function is computed for each filename, then used to compare items while sorting and grouping.

docs.sort(key=datestamp)
for date, docs_on_date in itertools.groupby(docs, key=datestamp):
    doc_list = list(docs_on_date)
    print(f'{date}: {doc_list}')
    # for doc in doc_list:
    #     # do_something_with(doc)

Results:

20120825: ['Dictation 201208251425.md']
20210920: ['Tyler Cowen On Reading 202109200657.md', 'Slava Akhmechet On Reading In Clusters 202109200659.md', 'Ideation In A 4X4 Matrix 202109200717.md']
20210921: ['On Poems 202109210659.md']
20210922: ['Drawing Grid Ideation 202109220830.md']

CodePudding user response:

If you're trying to replace a shell script, your Python script will probably need to do the following.

  1. List the contents of a directory to get the filenames.
  2. Extract the date from the filenames (assuming a regular expression pattern match of \d{8} is good enough to extract the date).
  3. Sort or otherwise group the files by the extracted date.
  4. Iterate over those groups to do something.
import pathlib
import re
from collections import defaultdict

date_pattern = re.compile(r"\d{8}")
target_dir = pathlib.Path("myfolder")

# Files is a dictionary mapping a date to the list of files with that date
files = defaultdict(list)
for child in target_dir.iterdir():
    # Skip directories
    if child.is_dir():
        continue
    match = date_pattern.search(child.name)
    # Skip files that do not match the date pattern
    if match is None:
        continue
    file_date = match.group()
    files[file_date].append(child)

for date, names in files.items():
    for filename in names:
        # Do something
        print(date, filename)

Edit: sort by the date

To sort by the date, the last code block can be modified.

for date in sorted(files):
    for filename in files[date]:
        # Do something
        print(date, filename)

You could also use for date, names in sorted(files.items(), key=lambda d: d[0]):

  • Related