Home > OS >  Extracting infor from a string in Python and returning a list
Extracting infor from a string in Python and returning a list

Time:10-29

I am studying a large dataset, and I'm shifting my analysis from Matlab/Octave to Python. The files are organized by folder/directories, with each directory name containing basic information about the data. In Matlab I extract that info from the folder name. I want to do the same in Python. I know a bit about Python, but I'm not definitely a kung-fu master

MWE

from re import split

file_list = ['15L-0.3', '16L-0.4_redo', '15L-0', '16L-redo']
s_f = lambda x: float(x) if isinstance(x,(int, float))\
    else [split('[^0-9.] ',x,1)[0], split('0[.0-9]*',x,1)[1]]
array = [[i, [float(i.split('L',1)[0]),
    s_f(i)]]for i in file_list]

The previous code does not work for all the elements of the file_list and the return from the lambda is not appended to the array. And I'm mixing the standard split() with the re.split() version. But I don't think the standard version accepts regular expressions.

I need an array, a list of lists, where for each element of file_list I get the foldername, i.e. the element in file_list. The second element of thaat sub list is an array with the number before the 16, and then it can have 1 or 2 other elements. The number between 0 and 1 after the L, which is not always present, and whatever comes after this number, or even the L itself if the number is not present.

For the 1st element [15, 0.3]

For the 2nd [16, 0.4, '_redo']

For the 3rd [15, 0]

For the 4th [16, '-redo']

I have several stages of logic, as my actual string are longer and have multiple parameters. I could split this into the absence or presence of either the number between 0 and 1 and/or the suffix, but I wanted to see if there is a way to make this general

I wrote the lambda function for that. If the output is a single number or string, it works. Problems arise when I need to output 2 elements for the outer list.

Apologies if the syntax is not correct in the way I name things. I might mixed up lists with arrays

Any comment, correction, or suggestion is welcome

CodePudding user response:

I think the better sollution here is to actually look for the patterns that you want to match instead.

I think this will solve your problem, and is much easier to debug:

import re

file_list = ['15L-0.3', '16L-0.4_redo', '15L-0', '16L-redo']
new_file_list = []
for file in file_list:
    split_file = re.findall(r'(?:\d (?:\.\d )*|[-_]redo)', file)
    new_file_list.append(split_file)

print(new_file_list)

Output:

[['15', '0.3'], ['16', '0.4', '_redo'], ['15', '0'], ['16', '-redo']]

CodePudding user response:

You can also use the following solution with a regex pattern that captures all three parts with last two optional:

import re,ast

file_list = ['15L-0.3', '16L-0.4_redo', '15L-0', '16L-redo']
rx = re.compile(r'^(\d )L(?:-(\d (?:\.\d )?))?([_-].*)?$')
array = []
for i in file_list:
    m = rx.search(i)
    if m:
        arr = list(m.groups())
        arr[0] = int(arr[0]) # This is an int
        if arr[1]: # If Group 2 matched, it is either a doat or int float
            arr[1] = ast.literal_eval(arr[1]) # Parse the second number as int or float
        array.append([x for x in arr if x is not None]) # Remove any None values

print (array)
# => [[15, 0.3], [16, 0.4, '_redo'], [15, 0], [16, '-redo']]        

See the Python demo. Here is the regex demo. Details:

  • ^ - start of string
  • (\d ) - Group 1: one or more digits (so, we can use int(arr[0]) safely)
  • L - an L letter
  • (?:-(\d (?:\.\d )?))? - an optional sequence of
    • - - a hyphen
    • (\d (?:\.\d )?) - Group 2: one or more digits and then an optional sequence of a . and one or more digits
  • ([_-].*)? - an optional Group 3: _ or - and then any chars (other than line break chars without re.DOTALL flag) up to the end of the string.

CodePudding user response:

You can use re.split with a lookahead regex, and a list comprehension with a helper function:

import re

regex = re.compile('[-_](?=\d)|(?=[-_]\D)')

def toint(n):
    n2 = n.rstrip('L')
    if n2.replace('.', '', 1).isnumeric():
        return float(n2) if '.' in n2 else int(n2)
    else:
        return n

[[toint(i) for i in l] for l in map(regex.split, file_list)]

output:

[[15, 0.3], [16, 0.4, '_redo'], [15, 0], [16, '-redo']]
  • Related