Home > Blockchain >  Classifying list of filenames into its respective types
Classifying list of filenames into its respective types

Time:09-21

So I have this list that contains lots of filenames in a directory with its respective types. Say that the list look like this:

list = ['apple-20220103.csv', 'apple_tea-20220304.csv', '20220203-apple_town.csv', 'apple_town20220101.csv']

and the types of file are stored in a .csv file like this:

,type
0,apple
1,apple_tea
2,apple_town

I want to classify each filename in the list into its respective type of file and put them into a dictionary. Say that the dictionary would look like this after processed:

dictionary = {
     'apple':['apple-20220103.csv'],
     'apple_tea':['apple_tea-20220304.csv'],
     'apple_town':['20220203-apple_town.csv', 'apple_town20220101.csv'
}

The question is how can I ensure so that apple would not receive any file besides apple-20220103.csv, despite other filenames also contain the word apple in it? I've tried using simple regex matching, and the result still has apple_tea and apple_town filenames in apple.

CodePudding user response:

You could match everything which is not a number or a dash by the pattern given below. Then you can use the complete match as a key for your dictionary.

your_list = ['apple-20220103.csv', 'apple_tea-20220304.csv', '20220203-apple_town.csv', 'apple_town20220101.csv']

pattern = '[^0-9\-] '

for element in your_list:
    a=re.search(pattern, element[:-4])
    print(a.group())

# Output
apple
apple_tea
apple_town
apple_town

CodePudding user response:

Please take look at word boundary \b

import re
filenames = ['apple-20220103.csv', 'apple_tea-20220304.csv', '20220203-apple_town.csv', 'apple_town20220101.csv']
categories = ['apple','apple_tea','apple_town']
for category in categories:
    print(category)
    pattern = r'\b'   re.escape(category)   r'\b'
    for filename in filenames:
        if re.search(pattern, filename):
            print(filename)
    print()

gives output

apple
apple-20220103.csv

apple_tea
apple_tea-20220304.csv

apple_town
20220203-apple_town.csv

From re docs

\b

Matches the empty string, but only at the beginning or end of a word. A word is defined as a sequence of word characters. Note that formally, \b is defined as the boundary between a \w and a \W character (or vice versa), or between \w and the beginning/end of the string.(...)

I also use re.escape to make sure that if character of special meaning will appear in category name (e.g. dot) they will be treated as literal character.

CodePudding user response:

One approach to the problem could be to use the library difflib.

import pandas as pd
import difflib
list = ['apple-20220103.csv', 'apple_tea-20220304.csv', '20220203-apple_town.csv', 'apple_town20220101.csv']
csv_file = pd.read_csv("file.csv")

thisdict = {}

for row in csv_file.iterrows():
    close = difflib.get_close_matches(row[1][1], list, len(list), 0)
    thisdict[str(row[1][1])] = close[0]
print(thisdict)

This produces the following output.

{'apple': 'apple-20220103.csv', 'apple_tea': 'apple_tea-20220304.csv', 'apple_town': 'apple_town20220101.csv'}

Notice that only the closest string gets put into the dictionary.

  • Related