Extracting a diffrentiating numerical value from multiple files

I have multiple text files containing different text. They all contain a single appearance of the same 2 lines I am interested in:

================================================================
Result: XX/100

I am trying to write a script to collect all those XX values (numerical values between 0 and 100), and paste them in a CSV file with the text file name in column A and the numerical value in column B.

I have considered using Python or PowerShell for this purpose.

How can I identify the line where "Result" appears under the string of "===..", collect its content until '\n', and then strip it from "Result: " and "/100"?

"Result" and other numerical values could appear in the files, but never in the quoted format, and below "=====", like the line im interested in.

Thank you!

Edit: I have written this poor naive attempt to collect the numerical values.

import os
dir_path = os.path.dirname(os.path.realpath(__file__))
for filename in os.listdir(dir_path):
    if filename.endswith(".txt"):
        with open(filename,"r") as f:
            lineFound=False
            for index, line in enumerate(f):
                if lineFound:
                    line=line.replace("Result: ", "")
                    line=line.replace("/100","")
                    line.strip()
                    grade=line
                    lineFound=False
                    print(grade, end='')
                    continue
                if index>3:
                    if "================================================================" in line:
                        lineFound=True

I'd still be happy to learn if there's a simple way to do this with PowerShell tbh For the output, I used csv writer to append the results to a file one by one.

CodePudding user response：

So there's two steps involved here, first is to get a list of files. There's a ton of answers for that one on stackoverflow, but this one is stupidly complete.

Once you have the list of files, you can simply just load the files themselves one by one, and then do some simple string.split() to get the value you want.

Finally, write the results into a CSV file. Since the CSV file is a simple one, you don't need to use the CSV library for this.

See the code example below. Note that I copied/pasted the function for generating the list of files from my personal github repo. I reuse that one a lot.

import os


def get_files_from_path(path: str = ".", ext:str or list=None) -> list:
    """Find files in path and return them as a list.
    Gets all files in folders and subfolders
    See the answer on the link below for a ridiculously
    complete answer for this.
    https://stackoverflow.com/a/41447012/9267296
    Args:
        path (str, optional): Which path to start on.
                              Defaults to '.'.
        ext (str/list, optional): Optional file extention.
                                  Defaults to None.
    Returns:
        list: list of file paths
    """
    result = []
    for subdir, dirs, files in os.walk(path):
        for fname in files:
            filepath = f"{subdir}{os.sep}{fname}"
            if ext == None:
                result.append(filepath)
            elif type(ext) == str and fname.lower().endswith(ext.lower()):
                result.append(filepath)
            elif type(ext) == list:
                for item in ext:
                    if fname.lower().endswith(item.lower()):
                        result.append(filepath)
    return result


filelist = get_files_from_path("path/to/files/", ext=".txt")
split1 = "================================================================\nResult: "
split2 = "/100"


with open("output.csv", "w") as outfile:
    outfile.write('filename, value\n')
    for filename in filelist:
        with open(filename) as infile:
            value = infile.read().split(split1)[1].split(split2)[0]
        print(value)
        outfile.write(f'"{filename}", {value}\n')

CodePudding user response：

You could try this.

In this example the filename written to the CSV will be its full (absolute) path. You may just want the base filename.

Uses the same, albeit seemingly unnecessary, mechanism for deriving the source directory. It would be unusual to have your Python script in the same directory as your data.

import os
import glob

equals = '=' * 64
dir_path = os.path.dirname(os.path.realpath(__file__))
outfile = os.path.join(dir_path, 'foo.csv')
with open(outfile, 'w') as csv:
    print('A,B', file=csv)
    for file in glob.glob(os.path.join(dir_path, '*.txt')):
        prev = None
        with open(file) as indata:
            for line in indata:
                t = line.split()
                if len(t) == 2 and t[0] == 'Result:' and prev.startswith(equals):
                    v = t[1].split('/')
                    if len(v) == 2 and v[1] == '100':
                        print(f'{file},{v[0]}', file=csv)
                        break
                prev = line