Home > Mobile >  Parse filenames and make a total count of occurances of the first two words, over 50K names
Parse filenames and make a total count of occurances of the first two words, over 50K names

Time:08-03

I wanted to make a count of the total number of filenames that had the same 2 unique words in the start of the filename (tree names or other identifier), and iterate through the whole folder (over 50K files), recording those names and give individual and overall totals for the occurances.

The filenames look something like this, or variations thereof:

Abies_alba_0_2545_WEFL_NLF.tif
Abies_alba_8_321565_WEFL_NLF.tif
Larix_kaempferi_3_43357_WEFL_NLF.tif

I actually managed a workaround and got the results that I wanted - but this was very slow for me to manually capture the key parts of the string, and manually repeat the script. I used How to count number of files in a file with certain extension or name? as a basis and produced this:

import glob
import os

# These are our counters
total_count = 0
Abies_alba_count = 0
Acer_pseudoplatanus_count = 0
Alnus_spec_count = 0
Betula_spec_count = 0
Cleared_0_count = 0
Fagus_sylvatica_count = 0
Fraxinus_excelsior_count = 0
Larix_decidua_count = 0
Larix_kaempferi_count = 0
Picea_abies_count = 0 
Pinus_nigra_count = 0
Pinus_strobus_count = 0 
Pinus_sylvestris_count = 0
Populus_spec_count = 0
Prunus_spec_count = 0
Pseudotsuga_menziesii_count = 0
Quercus_petraea_count = 0
Quercus_robur_count = 0
Quercus_rubra_count = 0
Tilia_spec_count = 0


for file in 
    os.listdir(r'FILEPATH TO FOLDER HOLDING THE FILES'):
    if(file.endswith('tif')):
        total_count  = 1
        if 'Abies_alba' in file:
            Abies_alba_count  = 1
        if 'Acer_pseudoplatanus' in file:
            Acer_pseudoplatanus_count  = 1
        if 'Alnus_spec' in file:
            Alnus_spec_count  = 1
        if 'Betula_spec' in file:
            Betula_spec_count  = 1
        if 'Cleared_0' in file:
            Cleared_0_count  = 1
        if 'Fagus_sylvatica' in file:
            Fagus_sylvatica_count  = 1
        if 'Fraxinus_excelsior' in file:
            Fraxinus_excelsior_count  = 1
        if 'Larix_decidua' in file:
            Larix_decidua_count  = 1
        if 'Larix_kaempferi' in file:
            Larix_kaempferi_count  = 1
        if 'Picea_abies' in file:
            Picea_abies_count  = 1
        if 'Pinus_nigra' in file:
            Pinus_nigra_count  = 1
        if 'Pinus_strobus' in file:
            Pinus_strobus_count  = 1
        if 'Pinus_sylvestris' in file:
            Pinus_sylvestris_count  = 1
        if 'Populus_spec' in file:
            Populus_spec_count  = 1
        if 'Prunus_spec' in file:
            Prunus_spec_count  = 1
        if 'Pseudotsuga_menziesii' in file:
            Pseudotsuga_menziesii_count  = 1
        if 'Quercus_petraea' in file:
            Quercus_petraea_count  = 1
        if 'Quercus_robur' in file:
            Quercus_robur_count  = 1
        if 'Quercus_rubra' in file:
            Quercus_rubra_count  = 1
        if 'Tilia_spec' in file:
            Tilia_spec_count  = 1

print('Abies alba:', Abies_alba_count)
print('Acer pseudoplatanus:', Acer_pseudoplatanus_count)
print('Alnus spec:', Alnus_spec_count)
print('Betula_spec:', Betula_spec_count)
print('Cleared 0:', Cleared_0_count)
print('Fagus sylvatica:', Fagus_sylvatica_count)
print('Fraxinus excelsior:', Fraxinus_excelsior_count)
print('Larix decidua:', Larix_decidua_count)
print('Larix kaempferi:', Larix_kaempferi_count)
print('Picea abies:', Picea_abies_count)
print('Pinus nigra:', Pinus_nigra_count)
print('Pinus strobus:', Pinus_strobus_count)
print('Pinus sylvestris:', Pinus_sylvestris_count)
print('Populus spec:', Populus_spec_count)
print('Prunus spec:', Prunus_spec_count)
print('Pseudotsuga menziesii:', Pseudotsuga_menziesii_count)
print('Quercus petraea:', Quercus_petraea_count)
print('Quercus robur:', Quercus_robur_count)
print('Quercus rubra:', Quercus_rubra_count)
print('Tilia spec:', Tilia_spec_count)
print('Total:', total_count)

Which works and gives me the results I wanted, as below:

Abies alba: 984
Acer pseudoplatanus: 2821
Alnus spec: 2563
Betula_spec: 2821
Cleared 0: 4123
Fagus sylvatica: 6459
Fraxinus excelsior: 2634
Larix decidua: 1360
Larix kaempferi: 1748
Picea abies: 5783
Pinus nigra: 421
Pinus strobus: 500
Pinus sylvestris: 6591
Populus spec: 464
Prunus spec: 304
Pseudotsuga menziesii: 2691
Quercus petraea: 2608
Quercus robur: 3453
Quercus rubra: 1841
Tilia spec: 212
Total: 50381

So, yes, this works but was awful to do and I understand to be smelly code if Im using the term correctly! Could someone advise on how to get to the end result without all the manual interference that I had to do?

I intend to also plot the output in some follow up work, showing the number weighting of the file/tree types, but was trying to avoid writing to a CSV and working the results from there as I understand that to be bad practice. Any further tips for this?

CodePudding user response:

Use a dictionary! Updated to dynamically determine the words to search.

words = set()
counts = {}
total_count = 0
file_names = os.listdir(r'FILEPATH TO FOLDER HOLDING THE FILES')
for file_name in file_names:
    fn_words = file_name.split("_")
    words.add(f"{fn_words[0]} {fn_words[1]}")
for word in words:
    for file_name in file_names:
        if word in file_name and word in counts: 
            counts[word]  = 1
            total_count  = 1
        elif word in file_name: 
            counts[word] = 1
            total_count  = 1
[print(f"{word}: {count}" for word, count in counts.items()] ### You could iterate through this dict in any way if you would rather process the data in some other fashion than just printing.
print(f"Total: {total_count}")

CodePudding user response:

You can try this.

import os
file_names = os.listdir('YOUR_DIRECTORY')
file_counts = {}
for filename in file_names:
  if filename.endswith('tif'):
    filename_parts = filename.split('_')
    key = filename_parts[0]   '_'   filename_parts[1]
    if key in file_counts.keys():
      file_counts[key]  = 1
    else:
      file_counts[key] = 1

print(file_counts)
  • Related