Given a directory containing the following files:
pcasvm_dataset_window_blackman_nperseg_4096_distance_1_speed_25k
pcasvm_dataset_window_blackman_nperseg_4096_distance_2_speed_25k
pcasvm_dataset_window_blackman_nperseg_8192_distance_1_speed_100k
pcasvm_dataset_window_blackman_nperseg_16384_distance_1_speed_200k
pcasvm_dataset_window_hamming_nperseg_4096_distance_1_speed_25k
pcasvm_dataset_window_hamming_nperseg_8192_distance_5_speed_25k
pcasvm_dataset_window_hann_nperseg_4096_distance_1_speed_25k
...
I can read these in with the following comprehension: datasets = [d for d in os.listdir('path/to/dir')]
However, what I want to do is analyse these datasets in group, with the groups being:
window
(i.e. blackman, hann) and nperseg
(i.e. 8192, 4096, etc.)
The problem here is how to best achieve this fairly quickly given a large number of actual datasets. Would a dictionary be ideal? For example:
dict(
blackman: dict(
4096: [file1, file2, file3],
8192: [..., ],
...
),
...
)
Thanks!
CodePudding user response:
If I understand you correctly, you can use re
to parse filenames and dict.setdefault
to group them:
import re
file_names = [
"pcasvm_dataset_window_blackman_nperseg_4096_distance_1_speed_25k",
"pcasvm_dataset_window_blackman_nperseg_4096_distance_2_speed_25k",
"pcasvm_dataset_window_blackman_nperseg_8192_distance_1_speed_100k",
"pcasvm_dataset_window_blackman_nperseg_16384_distance_1_speed_200k",
"pcasvm_dataset_window_hamming_nperseg_4096_distance_1_speed_25k",
"pcasvm_dataset_window_hamming_nperseg_8192_distance_5_speed_25k",
"pcasvm_dataset_window_hann_nperseg_4096_distance_1_speed_25k",
]
pat = re.compile(r"window_([^_] )_nperseg_([^_] )")
out = {}
for name in file_names:
m = pat.search(name)
if m:
out.setdefault(m.group(1), {}).setdefault(m.group(2), []).append(name)
print(out)
Prints:
{
"blackman": {
"4096": [
"pcasvm_dataset_window_blackman_nperseg_4096_distance_1_speed_25k",
"pcasvm_dataset_window_blackman_nperseg_4096_distance_2_speed_25k",
],
"8192": [
"pcasvm_dataset_window_blackman_nperseg_8192_distance_1_speed_100k"
],
"16384": [
"pcasvm_dataset_window_blackman_nperseg_16384_distance_1_speed_200k"
],
},
"hamming": {
"4096": [
"pcasvm_dataset_window_hamming_nperseg_4096_distance_1_speed_25k"
],
"8192": [
"pcasvm_dataset_window_hamming_nperseg_8192_distance_5_speed_25k"
],
},
"hann": {
"4096": ["pcasvm_dataset_window_hann_nperseg_4096_distance_1_speed_25k"]
},
}