Process text files with Python-CodePudding

Complete newbie is here, please help. Suppose I have a text file which looks like this:

some strings
go here 

* head1 head2 head3 ... headN
  3     "a"   0.3   ... -2
  0.1   "b"   5     ... 1
  10    "c"   -4    ... 100
# and other rows with some numbers

So, I have several strings before the main block of interest. This block has "header" line, note that it starts with "*" and real columns' heads start from 2nd column. Next there are rows with some numbers and strings, which correspond to particular head[i].

I need to process this block line by line depending on the string value in head2 column: if, for example, value of head2 is "a" then write new string in a file like 'param1 = 3, param2 = 0.3', i.e. take values from head1 and head3 of the current line processed.

The problem is that this "header" line can have different number of elements and the order of head[i] can vary, so this row can be

* head3 head1 head2 ... headN

I need to make some association between column name and column values so for each line I can use like if line.head2 == "a" then ... How to do that?

CodePudding user response：

You can use a DictReader which will give you all the power and robustness of the csv module, but you will have to first skip the initial lines.

You could use a 3 step processing:

ignore any line before a line starting with a *
extract field names from that line after skipping its initial * characters
process the other lines as a normal csv file which you know the name of the fields

Possible code:

with open(filename) as fd:
    # skip the initial lines up to a line starting with a *
    for line in fd:
        if line.startswith('*'):
            break
    # use a DictReader to parse that line (after the initial *)
    rd = csv.DictReader(io.StringIO(line[1:]), delimiter=' ',
                        skipinitialspace=True)
    # prepare a DictReader for the rest of the file
    fieldnames = rd.fieldnames
    rd = csv.DictReader(fd, fieldnames=fieldnames, delimiter=' ',
                        skipinitialspace=True)
    for row in rd:
        if row['head2'] == 'a':
            # add your processing here...

The rationale for using the csv module is that Python comes battery included, and that the csv module is a very robust module able to handle fields containing the delimiter of even newlines. So best practices recomment to always use it for processing csv files instead of a custom parser.

CodePudding user response：

This will read your text file line by line and if a header row has been found then write the current row as dictionary to a list called rows.

I've added a simple example of how the elements in the rows list can be accessed.

headers = []
rows = []
with open('input.txt') as f:
    for line in f:
        split_line = line.strip().split()
        if headers:
            rows.append(dict(zip(headers, split_line)))
        if split_line and '*' == split_line[0]:
            headers = split_line[1:]

for row in rows:
    if row['head2'] == '"a"':
        print('found an "a"')

File: input.txt

some strings
go here 

* head1 head2 head3 headN
  3     "a"   0.3   -2
  0.1   "b"   5     1
  10    "c"   -4    100

CodePudding user response：

If it always starts with 3 lines you can use python's array to jump over lines[3:]. If you read this text line by line wait for the '*' by using if line[0] == '*'.

Now for the parsing part, first we will parse the headers by using the split function

headers = line.split()[1:]

we are splitting by the whitespace delimiter (default it split function) and then we are ignoring the first element from the split ("*") this will give you an array of headers.

Now we can continue by parsing each line and creating a mapping between header and value (I'm ignoring value parsing from str to int/float/any other type)

data_dict = {}
splitted_line = line.split()
for i in range(len(headers)):
  data_dict[headers[i]] = splitted_line[i]
print(data_dict)
parsed_data.append(data_dict)

while parsed_data is the global data container

CodePudding user response：

This answer is just minor "improvements" of @SergeBallesta's answer (previously posted in this comment).

To not iterate over file until asterisk appears manually we can use itertools.dropwhile().

dropwhile(lambda x: not x.startswith("*"), f)

It will skip all lines until one which starts from "*" appears.

To not reinitialize DictReader twice just to remove fist column we can patch DictReader.fieldnames which obtained from first line read after DictReader initialization. Remove first item from list can be done with simple del statement.

So here is my version:

from csv import DictReader
from itertools import dropwhile

with open("file.txt") as f:
    reader = DictReader(dropwhile(lambda x: not x.startswith("*"), f),
        delimiter=' ', skipinitialspace=True)
    del reader.fieldnames[0]
    for row in reader:
        if row["head2"] == "a":
            # do something

You can help my country, check my profile info.