I have a folder with the following structure:
data
|-folder1
|--subfolder1
|--file1
|--file2
|--subfolder2
|file1
|file2
|-folder2
|--subfolder1
|--file1
|--file2
|--subfolder2
|file1
|file2
with many folders, subfolder and files.
How can i create a list that is subdivided into smaller lists that contain my data?
For example, I'd end up with a list called data
and I could retrieve file1 from folder1-subfolder1 by indexing data[0][0][0]
?
As of now, I have created empty lists for each file but I'm not sure on how to append to a list of lists.
I have:
file1 = []
file2 = []
for folder in sorted(os.listdir(path)):
if folder != 'Documentation.txt':
for subfolder in sorted(os.listdir(path '/' folder)):
if subfolder != '.DS_Store':
for file in sorted(os.listdir(path '/' folder '/' subfolder)):
if file.endswith(".x.dat"):
file1.append(pd.read_csv((path '/' folder '/' subfolder '/' file), header=None, sep=' '))
if file.endswith(".y.dat"):
file2.append(pd.read_csv((path '/' folder '/' subfolder '/' file), header=None, sep=' '))
data = [file1, file2]
This returns all the data files, but I'm struggling to figure out how to nest each file in a list of list according to the original folder structure... I feel like the solution will be pretty trivial, i'm just not great with python. Thanks
CodePudding user response:
It's not clear to me what's the exact output you want, but I'm pretty sure os.walk
is probably the best option for you to generate a tree of your files:
>>> import os
>>> import re
>>> data_path = '/Users/nilton/data'
>>> files_paths = []
>>> for dirpath, dirnames, filenames in os.walk(data_path):
... for filename in filenames:
... if re.match('\.dat', filename, re.I):
... files_paths.append(filename)
...
>>> files_paths
['/Users/nilton/data/folder2/subfolder2/file2.dat',
'/Users/nilton/data/folder2/subfolder2/file1.dat',
...]
Knowing this and reading the os.walk
documentation, you can manage to get your desired output by playing with the 3-tuple (dirpath, dirnames, filenames)
output from os.walk
.
CodePudding user response:
You could try the following with pathlib
's Path.rglob()
and groupby
from itertools
(all standard library):
from pathlib import Path
from itertools import groupby
from functools import partial
def key(i, file): return file.parent.parts[i]
base = Path("data")
data = []
for _, group1 in groupby(base.rglob("*.dat"), key=partial(key, 1)):
data.append([])
for _, group2 in groupby(group1, key=partial(key, 2)):
data[-1].append([file.name for file in group2])
With a test structure created by
base = Path("data")
for i in range(1, 4):
for j in range(1, 3):
path = (base / f"folder{i}") / f"subfolder{j}"
path.mkdir(parents=True, exist_ok=True)
for k in range(1, 3):
with open(path / f"file{i}-{j}-{k}.dat", "w") as file:
file.write("A,B,C\n1,2,3\n4,5,6")
this delivers the following data
:
[[['file1-1-1.dat', 'file1-1-2.dat'], ['file1-2-1.dat', 'file1-2-2.dat']],
[['file2-1-1.dat', 'file2-1-2.dat'], ['file2-2-1.dat', 'file2-2-2.dat']],
[['file3-1-1.dat', 'file3-1-2.dat'], ['file3-2-1.dat', 'file3-2-2.dat']]]
Your code implies that you actually don't want to collect the filenames but pd.csv_read()
them and store the dataframes in data
. To do that you have to replace
data[-1].append([file.name for file in group2])
with
data[-1].append([pd.read_csv(file) for file in group2])
And it might well be that you have to add more logic to the file selection: I just went with the .dat
suffix.
You could do something similar with os.walk
instead, as suggested in the other answer.