So I have this list that contains lots of filenames in a directory with its respective types. Say that the list look like this:
list = ['apple-20220103.csv', 'apple_tea-20220304.csv', '20220203-apple_town.csv', 'apple_town20220101.csv']
and the types of file are stored in a .csv
file like this:
,type
0,apple
1,apple_tea
2,apple_town
I want to classify each filename in the list into its respective type of file and put them into a dictionary. Say that the dictionary would look like this after processed:
dictionary = {
'apple':['apple-20220103.csv'],
'apple_tea':['apple_tea-20220304.csv'],
'apple_town':['20220203-apple_town.csv', 'apple_town20220101.csv'
}
The question is how can I ensure so that apple
would not receive any file besides apple-20220103.csv
, despite other filenames also contain the word apple
in it? I've tried using simple regex matching, and the result still has apple_tea
and apple_town
filenames in apple
.
CodePudding user response:
You could match everything which is not a number or a dash by the pattern given below. Then you can use the complete match as a key for your dictionary.
your_list = ['apple-20220103.csv', 'apple_tea-20220304.csv', '20220203-apple_town.csv', 'apple_town20220101.csv']
pattern = '[^0-9\-] '
for element in your_list:
a=re.search(pattern, element[:-4])
print(a.group())
# Output
apple
apple_tea
apple_town
apple_town
CodePudding user response:
Please take look at word boundary \b
import re
filenames = ['apple-20220103.csv', 'apple_tea-20220304.csv', '20220203-apple_town.csv', 'apple_town20220101.csv']
categories = ['apple','apple_tea','apple_town']
for category in categories:
print(category)
pattern = r'\b' re.escape(category) r'\b'
for filename in filenames:
if re.search(pattern, filename):
print(filename)
print()
gives output
apple
apple-20220103.csv
apple_tea
apple_tea-20220304.csv
apple_town
20220203-apple_town.csv
From re
docs
\b
Matches the empty string, but only at the beginning or end of a word. A word is defined as a sequence of word characters. Note that formally,
\b
is defined as the boundary between a\w
and a\W
character (or vice versa), or between\w
and the beginning/end of the string.(...)
I also use re.escape
to make sure that if character of special meaning will appear in category name (e.g. dot) they will be treated as literal character.
CodePudding user response:
One approach to the problem could be to use the library difflib
.
import pandas as pd
import difflib
list = ['apple-20220103.csv', 'apple_tea-20220304.csv', '20220203-apple_town.csv', 'apple_town20220101.csv']
csv_file = pd.read_csv("file.csv")
thisdict = {}
for row in csv_file.iterrows():
close = difflib.get_close_matches(row[1][1], list, len(list), 0)
thisdict[str(row[1][1])] = close[0]
print(thisdict)
This produces the following output.
{'apple': 'apple-20220103.csv', 'apple_tea': 'apple_tea-20220304.csv', 'apple_town': 'apple_town20220101.csv'}
Notice that only the closest string gets put into the dictionary.