open files, keeping folder structure as nested list-CodePudding

I have a folder with the following structure:

data
|-folder1
  |--subfolder1
     |--file1
     |--file2
  |--subfolder2
     |file1
     |file2
|-folder2
  |--subfolder1
     |--file1
     |--file2
  |--subfolder2
     |file1
     |file2

with many folders, subfolder and files.

How can i create a list that is subdivided into smaller lists that contain my data? For example, I'd end up with a list called data and I could retrieve file1 from folder1-subfolder1 by indexing data[0][0][0]? As of now, I have created empty lists for each file but I'm not sure on how to append to a list of lists.

I have:

file1 = []
file2 = []
for folder in sorted(os.listdir(path)):
    if folder != 'Documentation.txt':
        for subfolder in sorted(os.listdir(path   '/'   folder)):
            if subfolder != '.DS_Store':
                for file in sorted(os.listdir(path  '/'   folder   '/'   subfolder)):
                    if file.endswith(".x.dat"):
                        file1.append(pd.read_csv((path   '/'   folder   '/'   subfolder   '/'   file), header=None, sep=' '))
                    if file.endswith(".y.dat"):
                        file2.append(pd.read_csv((path   '/'   folder   '/'   subfolder   '/'   file), header=None, sep=' '))
data = [file1, file2]

This returns all the data files, but I'm struggling to figure out how to nest each file in a list of list according to the original folder structure... I feel like the solution will be pretty trivial, i'm just not great with python. Thanks

CodePudding user response：

It's not clear to me what's the exact output you want, but I'm pretty sure os.walk is probably the best option for you to generate a tree of your files:

>>> import os
>>> import re
>>> data_path = '/Users/nilton/data'
>>> files_paths = []
>>> for dirpath, dirnames, filenames in os.walk(data_path):
...     for filename in filenames:
...         if re.match('\.dat', filename, re.I):
...             files_paths.append(filename)
...
>>> files_paths
['/Users/nilton/data/folder2/subfolder2/file2.dat',
 '/Users/nilton/data/folder2/subfolder2/file1.dat',
 ...]

Knowing this and reading the os.walk documentation, you can manage to get your desired output by playing with the 3-tuple (dirpath, dirnames, filenames) output from os.walk.

CodePudding user response：

You could try the following with pathlib's Path.rglob() and groupby from itertools (all standard library):

from pathlib import Path
from itertools import groupby
from functools import partial

def key(i, file): return file.parent.parts[i]

base = Path("data")
data = []
for _, group1 in groupby(base.rglob("*.dat"), key=partial(key, 1)):
    data.append([])
    for _, group2 in groupby(group1, key=partial(key, 2)):
        data[-1].append([file.name for file in group2])

With a test structure created by

base = Path("data")
for i in range(1, 4):
    for j in range(1, 3):
        path = (base / f"folder{i}") / f"subfolder{j}"
        path.mkdir(parents=True, exist_ok=True)
        for k in range(1, 3):
            with open(path / f"file{i}-{j}-{k}.dat", "w") as file:
                file.write("A,B,C\n1,2,3\n4,5,6")

this delivers the following data:

[[['file1-1-1.dat', 'file1-1-2.dat'], ['file1-2-1.dat', 'file1-2-2.dat']],
 [['file2-1-1.dat', 'file2-1-2.dat'], ['file2-2-1.dat', 'file2-2-2.dat']],
 [['file3-1-1.dat', 'file3-1-2.dat'], ['file3-2-1.dat', 'file3-2-2.dat']]]

Your code implies that you actually don't want to collect the filenames but pd.csv_read() them and store the dataframes in data. To do that you have to replace

        data[-1].append([file.name for file in group2])

with

        data[-1].append([pd.read_csv(file) for file in group2])

And it might well be that you have to add more logic to the file selection: I just went with the .dat suffix.

You could do something similar with os.walk instead, as suggested in the other answer.