Split line into variables-CodePudding

I am trying to split a txt file with multiple lines into separate variables. The text is an output of volume information with names, data sizes, etc. and I wan to split each dataset into a specific variable but can't seem to get it

Example is trying to split this data set into a variable for each item

/vol0                                abcd4     Object RAID6   228.33 GB         --  400.00 GB  Online
/vole1                               abcd1     Object RAID6    44.19 TB   45.00 TB   45.00 TB  Online
/vole2                               abcd4     Object RAID6    11.27 TB   11.00 TB   12.00 TB  Online
/vol3                                abcd4     Object RAID6     9.50 TB         --   10.00 TB  Online
/vol4                                abcd1     Object RAID6    18.39 TB         --   19.10 TB  Online

This is the command I've run, but I keep getting an error about "not enough values to unpack".

inputfile = "dataset_input.txt"
with open(inputfile, "r") as input:
    for row in input:
        vol, bs, obj, raid, used, uunit, quota, qunit, q2, q2unit, status = row.split()

I can split the file by space just by doing the below text and it works. Just can't seem to get it into separate variables so I can manipulate the datasets

for row in input: #running through each row in the file
    output_text = row.split() #split the row based on the default white-space delimiter
    print(output_text)

I'm very new to python, so not sure if this is even possible, or how complicated it is

CodePudding user response：

Firstly what you done is call split method which would split your rows into a list for every single space present in the string. That would provide you a list with a bigger length than the number of variables you have defined. This can only be solved by splitting into the correct number of variables you need.

Secondly in every for loop the same variable would be rewritten with new values thus losing the previous iteration value you can solve this by having the values appended into respective variable arrays

Here is a simple solution in which you first read the entire text file contents , preprocess it and store the processed content into required variable lists

fle=open("dataset_input.txt",'r')
txt=fle.readlines()

#adding another newline for patter homogenity
txt[-1] ='\n'

n=len(txt)

#remove new lines 
for i in range(0,n):txt[i]=txt[i][0:-1]

#trim multi spaces to #
import re
for i in range(0,n):
    txt[i]=re.sub('\s{2,}','#',txt[i])
    txt[i]=txt[i].split('#')

#define required variables
x1=[]
x2=[]
x3=[]
x4=[]
x5=[]
x6=[]
x7=[]

#adding the variable values to respective variables
for i in txt:
    x1.append(i[0])
    x2.append(i[1])
    x3.append(i[2])
    x4.append(i[3])
    x5.append(i[4])
    x6.append(i[5])
    x7.append(i[6])

print(x1,x2,x3,x4,x5,x6,x7)

Also note that it is possible to improve the code by combining the list appending in pre process stage itself depending on your life requirement of the main text file contents

CodePudding user response：

the error not enough values to unpack is produced when executing this line of code : vol, bs, obj, raid, used, uunit, quota, qunit, q2, q2unit, status = row.split(). the reason is that you are reading 11 separate elements from each row, though looking at the example you show, not every row contains 11 words separated by space. check this out :

with open(inputfile, "r") as input:
    for row in input:
        output = row.split()
        print("this row provides {} arguments".format(len(output)))
        print(output)

the output :

this row provides 10 arguments
['/vol0', 'abcd4', 'Object', 'RAID6 ', '228.33', 'GB', '--', '400.00', 'GB', 'Online']
this row provides 11 arguments
['/vole1', 'abcd1', 'Object', 'RAID6 ', '44.19', 'TB', '45.00', 'TB', '45.00', 'TB', 'Online']
this row provides 11 arguments
['/vole2', 'abcd4', 'Object', 'RAID6 ', '11.27', 'TB', '11.00', 'TB', '12.00', 'TB', 'Online']
this row provides 10 arguments
['/vol3', 'abcd4', 'Object', 'RAID6 ', '9.50', 'TB', '--', '10.00', 'TB', 'Online']
this row provides 10 arguments
['/vol4', 'abcd1', 'Object', 'RAID6 ', '18.39', 'TB', '--', '19.10', 'TB', 'Online']

you need then to make some cleaning for you data-set, or maybe an if statement on the length would be helpful. looking at only the small portion of the data you provided i see that the mark "--" means that there is no volume. so you can replace the "--" mark with a couple of meaningful variables (value unit) for example 0 and any unit. This is how you might do it:

with open(inputfile, "r") as input:
        for row in input:
            output = str(row).replace("--","0 0").split()
            print("this row provides {} arguments".format(len(output)))
            print(output)

and this would be the output

this row provides 11 arguments
['/vol0', 'abcd4', 'Object', 'RAID6 ', '228.33', 'GB', '0', '0', '400.00', 'GB', 'Online']
this row provides 11 arguments
['/vole1', 'abcd1', 'Object', 'RAID6 ', '44.19', 'TB', '45.00', 'TB', '45.00', 'TB', 'Online']
this row provides 11 arguments
['/vole2', 'abcd4', 'Object', 'RAID6 ', '11.27', 'TB', '11.00', 'TB', '12.00', 'TB', 'Online']
this row provides 11 arguments
['/vol3', 'abcd4', 'Object', 'RAID6 ', '9.50', 'TB', '0', '0', '10.00', 'TB', 'Online']
this row provides 11 arguments
['/vol4', 'abcd1', 'Object', 'RAID6 ', '18.39', 'TB', '0', '0', '19.10', 'TB', 'Online']

CodePudding user response：

It looks to me like your data is a list of fixed length records and rather than using split() you might take slices based on your fixed length fields. Ultimatley, I would look at implementing using pythons struct but this might get you started processing a fixed length record.

Let's start with some example data you read from your file and let's define a list of fixed length field specifications.

data = [
    "/vol0                                abcd4     Object RAID6   228.33 GB         --  400.00 GB  Online",
    "/vole1                               abcd1     Object RAID6    44.19 TB   45.00 TB   45.00 TB  Online",
    "/vole2                               abcd4     Object RAID6    11.27 TB   11.00 TB   12.00 TB  Online",
    "/vol3                                abcd4     Object RAID6     9.50 TB         --   10.00 TB  Online",
    "/vol4                                abcd1     Object RAID6    18.39 TB         --   19.10 TB  Online"
]

##------------------------------
## Only you know for sure what the start and stop is of the fields in this fixed length record.
##------------------------------
fields = [
    {"name": "path", "starts_at": 0, "width": 37},
    {"name": "abc", "starts_at": 37, "width": 5},
    {"name": "type", "starts_at": 47, "width": 13},
    {"name": "size", "starts_at": 60, "width": 11},
    # ....
]
##------------------------------

Now, given your rows of data and the field definitions we can create a list of lists.

##------------------------------
## reshape as a list of lists
##------------------------------
data2 = [
    [
        row[field["starts_at"] : field["starts_at"]   field["width"]].strip()
        for field
        in fields
    ]
    for row
    in data
]
print(json.dumps(data2, indent=2))
##------------------------------

This should give you:

[
    ['/vol0', 'abcd4', 'Object RAID6 ', '228.33 GB'],
    ['/vole1', 'abcd1', 'Object RAID6 ', '44.19 TB'],
    ['/vole2', 'abcd4', 'Object RAID6 ', '11.27 TB'],
    ['/vol3', 'abcd4', 'Object RAID6 ', '9.50 TB'],
    ['/vol4', 'abcd1', 'Object RAID6 ', '18.39 TB']
]

I myself would rather work with a list of dict if possible, so given the data and field definitions above, I might use them like this...

##------------------------------
## reshape as a list of dict
##------------------------------
data2 = [
    {
        field["name"]: row[field["starts_at"] : field["starts_at"]   field["width"]].strip()
        for field
        in fields
    }
    for row
    in data

]

import json # only for printing a nice output
print(json.dumps(data2, indent=2))
##------------------------------

Giving you:

[
  {
    "path": "/vol0",
    "abc": "abcd4",
    "type": "Object RAID6 ",
    "size": "228.33 GB"
  },
  {
    "path": "/vole1",
    "abc": "abcd1",
    "type": "Object RAID6 ",
    "size": "44.19 TB"
  },
  {
    "path": "/vole2",
    "abc": "abcd4",
    "type": "Object RAID6 ",
    "size": "11.27 TB"
  },
  {
    "path": "/vol3",
    "abc": "abcd4",
    "type": "Object RAID6 ",
    "size": "9.50 TB"
  },
  {
    "path": "/vol4",
    "abc": "abcd1",
    "type": "Object RAID6 ",
    "size": "18.39 TB"
  }
]

CodePudding user response：

If you wanted to keep your original approach, something like this will cater for the error of sometimes having only 10 'columns' instead of the expected 11:

with open('dataset_input.txt') as f:
    lines = f.readlines()

for line in lines:
    line = line.strip().split()  # Remove white space and split by space, returns a list
    if line[6] == '--':
        # This means there is no quota value present
        # so insert another -- to correct the length ('columns') of the line to 11
        line.insert(6, '--')
    vol, bs, obj, raid, used, uunit, quota, qunit, q2, q2unit, status = tuple(line)
    # Perform any calculations and prints you want here 
    # PER LINE (each iteration will overwrite the variables above)
    # Note that all variables will be strings. So convert if required.

You can of course change the "--" to anything you want. e.g:

...
line.insert(6, '0')
...

and also change the "--" in the qunit as well if you wish:

...
line[6] = '0'
line.insert(6, '0')
...

On an unrelated side note, you have input as your file handle in your original code. input is a Python reserved keyword; these should be avoided when you choose any kind of identifier in your code.