Classifying list of filenames into its respective types-CodePudding

So I have this list that contains lots of filenames in a directory with its respective types. Say that the list look like this:

list = ['apple-20220103.csv', 'apple_tea-20220304.csv', '20220203-apple_town.csv', 'apple_town20220101.csv']

and the types of file are stored in a .csv file like this:

,type
0,apple
1,apple_tea
2,apple_town

I want to classify each filename in the list into its respective type of file and put them into a dictionary. Say that the dictionary would look like this after processed:

dictionary = {
     'apple':['apple-20220103.csv'],
     'apple_tea':['apple_tea-20220304.csv'],
     'apple_town':['20220203-apple_town.csv', 'apple_town20220101.csv'
}

The question is how can I ensure so that apple would not receive any file besides apple-20220103.csv, despite other filenames also contain the word apple in it? I've tried using simple regex matching, and the result still has apple_tea and apple_town filenames in apple.

CodePudding user response：

You could match everything which is not a number or a dash by the pattern given below. Then you can use the complete match as a key for your dictionary.

your_list = ['apple-20220103.csv', 'apple_tea-20220304.csv', '20220203-apple_town.csv', 'apple_town20220101.csv']

pattern = '[^0-9\-] '

for element in your_list:
    a=re.search(pattern, element[:-4])
    print(a.group())

# Output
apple
apple_tea
apple_town
apple_town

CodePudding user response：

Please take look at word boundary \b

import re
filenames = ['apple-20220103.csv', 'apple_tea-20220304.csv', '20220203-apple_town.csv', 'apple_town20220101.csv']
categories = ['apple','apple_tea','apple_town']
for category in categories:
    print(category)
    pattern = r'\b'   re.escape(category)   r'\b'
    for filename in filenames:
        if re.search(pattern, filename):
            print(filename)
    print()

gives output

apple
apple-20220103.csv

apple_tea
apple_tea-20220304.csv

apple_town
20220203-apple_town.csv

From re docs

\b

Matches the empty string, but only at the beginning or end of a word. A word is defined as a sequence of word characters. Note that formally, \b is defined as the boundary between a \w and a \W character (or vice versa), or between \w and the beginning/end of the string.(...)

I also use re.escape to make sure that if character of special meaning will appear in category name (e.g. dot) they will be treated as literal character.

CodePudding user response：

One approach to the problem could be to use the library difflib.

import pandas as pd
import difflib
list = ['apple-20220103.csv', 'apple_tea-20220304.csv', '20220203-apple_town.csv', 'apple_town20220101.csv']
csv_file = pd.read_csv("file.csv")

thisdict = {}

for row in csv_file.iterrows():
    close = difflib.get_close_matches(row[1][1], list, len(list), 0)
    thisdict[str(row[1][1])] = close[0]
print(thisdict)

This produces the following output.

{'apple': 'apple-20220103.csv', 'apple_tea': 'apple_tea-20220304.csv', 'apple_town': 'apple_town20220101.csv'}

Notice that only the closest string gets put into the dictionary.