Home > database >  open files, keeping folder structure as nested list
open files, keeping folder structure as nested list

Time:03-24

I have a folder with the following structure:

data
|-folder1
  |--subfolder1
     |--file1
     |--file2
  |--subfolder2
     |file1
     |file2
|-folder2
  |--subfolder1
     |--file1
     |--file2
  |--subfolder2
     |file1
     |file2

with many folders, subfolder and files.

How can i create a list that is subdivided into smaller lists that contain my data? For example, I'd end up with a list called data and I could retrieve file1 from folder1-subfolder1 by indexing data[0][0][0]? As of now, I have created empty lists for each file but I'm not sure on how to append to a list of lists.

I have:

file1 = []
file2 = []
for folder in sorted(os.listdir(path)):
    if folder != 'Documentation.txt':
        for subfolder in sorted(os.listdir(path   '/'   folder)):
            if subfolder != '.DS_Store':
                for file in sorted(os.listdir(path  '/'   folder   '/'   subfolder)):
                    if file.endswith(".x.dat"):
                        file1.append(pd.read_csv((path   '/'   folder   '/'   subfolder   '/'   file), header=None, sep=' '))
                    if file.endswith(".y.dat"):
                        file2.append(pd.read_csv((path   '/'   folder   '/'   subfolder   '/'   file), header=None, sep=' '))
data = [file1, file2]

This returns all the data files, but I'm struggling to figure out how to nest each file in a list of list according to the original folder structure... I feel like the solution will be pretty trivial, i'm just not great with python. Thanks

CodePudding user response:

It's not clear to me what's the exact output you want, but I'm pretty sure os.walk is probably the best option for you to generate a tree of your files:

>>> import os
>>> import re
>>> data_path = '/Users/nilton/data'
>>> files_paths = []
>>> for dirpath, dirnames, filenames in os.walk(data_path):
...     for filename in filenames:
...         if re.match('\.dat', filename, re.I):
...             files_paths.append(filename)
...
>>> files_paths
['/Users/nilton/data/folder2/subfolder2/file2.dat',
 '/Users/nilton/data/folder2/subfolder2/file1.dat',
 ...]

Knowing this and reading the os.walk documentation, you can manage to get your desired output by playing with the 3-tuple (dirpath, dirnames, filenames) output from os.walk.

CodePudding user response:

You could try the following with pathlib's Path.rglob() and groupby from itertools (all standard library):

from pathlib import Path
from itertools import groupby
from functools import partial

def key(i, file): return file.parent.parts[i]

base = Path("data")
data = []
for _, group1 in groupby(base.rglob("*.dat"), key=partial(key, 1)):
    data.append([])
    for _, group2 in groupby(group1, key=partial(key, 2)):
        data[-1].append([file.name for file in group2])

With a test structure created by

base = Path("data")
for i in range(1, 4):
    for j in range(1, 3):
        path = (base / f"folder{i}") / f"subfolder{j}"
        path.mkdir(parents=True, exist_ok=True)
        for k in range(1, 3):
            with open(path / f"file{i}-{j}-{k}.dat", "w") as file:
                file.write("A,B,C\n1,2,3\n4,5,6")

this delivers the following data:

[[['file1-1-1.dat', 'file1-1-2.dat'], ['file1-2-1.dat', 'file1-2-2.dat']],
 [['file2-1-1.dat', 'file2-1-2.dat'], ['file2-2-1.dat', 'file2-2-2.dat']],
 [['file3-1-1.dat', 'file3-1-2.dat'], ['file3-2-1.dat', 'file3-2-2.dat']]]

Your code implies that you actually don't want to collect the filenames but pd.csv_read() them and store the dataframes in data. To do that you have to replace

        data[-1].append([file.name for file in group2])

with

        data[-1].append([pd.read_csv(file) for file in group2])

And it might well be that you have to add more logic to the file selection: I just went with the .dat suffix.

You could do something similar with os.walk instead, as suggested in the other answer.

  • Related