Create sublist from main python list by condition-CodePudding

I have a code that generates a python list and I sorted this list in such lines

mylist = list_files(MAIN_DIR)
print(sorted(mylist, key=lambda x: str(x.split('\\')[-1][:-4].replace('_',''))))

Now I got the python list sorted as desired. How can I split this main python list into suclists based on the similar pdf names? Have a look at the snapshot to get what I mean exactly

Now I can loop through the keys and values of the grouped dictionary

for el in mylist:
    file_name = el.split('\\')[-1].split('.')[0]
    if file_name not in grouped_files.keys():
        grouped_files[file_name] = []
    grouped_files[file_name].append(el)

for key, value in grouped_files.items():
    pdfs = value
    merger = PdfFileMerger()
    for pdf in pdfs:
        merger.append(pdf)

    merger.write(OUTPUT_DIR / f'{key}.pdf')
    merger.close()

But I got an error AttributeError: 'WindowsPath' object has no attribute 'write'

CodePudding user response：

Here's a small example using groupby to create a dict of lists.

from itertools import groupby

mylist = ['..\\1\\1.pdf', '..\\2\\2.pdf', '..\\1\\1.pdf', '..\\2\\2.pdf']

key=lambda x: str(x.split('\\')[-1][:-4].replace('_',''))

results = {}
for filename, grouped in groupby(sorted(mylist, key=key), key=key):
    results[filename] = list(grouped)

results is a dictionary, with each key being the same key as was used to sort. Depending on what exactly you're looking to take into consideration, you can use different keys to derive different lists. One thing to note when using groupby is that if you want to make sure you're getting only a single set of data for each key, you need to sort by that key first. The other thing to note is that grouped is a generator object, not a list. This is a means of efficiency, but if you want a list, you can call list(grouped) to convert into a list, as shown.

>>> import pprint
>>> pprint.pprint(results)
{'1': ['..\\1\\1.pdf', '..\\1\\1.pdf'], '2': ['..\\2\\2.pdf', '..\\2\\2.pdf']}
>>>

It can also be solved with a defaultdict. This doesn't require having to sort anything first.

from collections import defaultdict

mylist = ['..\\1\\1.pdf', '..\\2\\2.pdf', '..\\1\\1.pdf', '..\\2\\2.pdf']

key=lambda x: str(x.split('\\')[-1][:-4].replace('_',''))

results = defaultdict(list)

for filename in mylist:
   results[key(filename)].append(filename)

Which results in

>>> import pprint
>>> pprint.pprint(results)
defaultdict(<class 'list'>,
            {'1': ['..\\1\\1.pdf', '..\\1\\1.pdf'],
             '2': ['..\\2\\2.pdf', '..\\2\\2.pdf']})

Using defaultdict means you can reference a value, in this case the same sort key, in a dictionary and get a default value back instead of a ValueError. It's set to return an empty list if it sees a new value, so we can append to the value every time.

CodePudding user response：

I'm guessing you want directory paths grouped by file names. If that's the case, my approach will be something like the following:

mylist = list_files(MAIN_DIR)

grouped_files = {}
for el in mylist:
    file_name = el.split('\\')[-1].split('.')[0]
    if file_name not in grouped_files.keys():
        grouped_files[file_name] = []
    grouped_files[file_name].append(el)