I am studying a large dataset, and I'm shifting my analysis from Matlab/Octave to Python. The files are organized by folder/directories, with each directory name containing basic information about the data. In Matlab I extract that info from the folder name. I want to do the same in Python. I know a bit about Python, but I'm not definitely a kung-fu master
MWE
from re import split
file_list = ['15L-0.3', '16L-0.4_redo', '15L-0', '16L-redo']
s_f = lambda x: float(x) if isinstance(x,(int, float))\
else [split('[^0-9.] ',x,1)[0], split('0[.0-9]*',x,1)[1]]
array = [[i, [float(i.split('L',1)[0]),
s_f(i)]]for i in file_list]
The previous code does not work for all the elements of the file_list
and the return from the lambda is not appended to the array. And I'm mixing the standard split()
with the re.split()
version. But I don't think the standard version accepts regular expressions.
I need an array, a list of lists, where for each element of file_list
I get the foldername, i.e. the element in file_list
. The second element of thaat sub list is an array with the number before the 16, and then it can have 1 or 2 other elements. The number between 0 and 1 after the L, which is not always present, and whatever comes after this number, or even the L itself if the number is not present.
For the 1st element
[15, 0.3]
For the 2nd
[16, 0.4, '_redo']
For the 3rd
[15, 0]
For the 4th
[16, '-redo']
I have several stages of logic, as my actual string are longer and have multiple parameters. I could split this into the absence or presence of either the number between 0 and 1 and/or the suffix, but I wanted to see if there is a way to make this general
I wrote the lambda function for that. If the output is a single number or string, it works. Problems arise when I need to output 2 elements for the outer list.
Apologies if the syntax is not correct in the way I name things. I might mixed up lists with arrays
Any comment, correction, or suggestion is welcome
CodePudding user response:
I think the better sollution here is to actually look for the patterns that you want to match instead.
I think this will solve your problem, and is much easier to debug:
import re
file_list = ['15L-0.3', '16L-0.4_redo', '15L-0', '16L-redo']
new_file_list = []
for file in file_list:
split_file = re.findall(r'(?:\d (?:\.\d )*|[-_]redo)', file)
new_file_list.append(split_file)
print(new_file_list)
Output:
[['15', '0.3'], ['16', '0.4', '_redo'], ['15', '0'], ['16', '-redo']]
CodePudding user response:
You can also use the following solution with a regex pattern that captures all three parts with last two optional:
import re,ast
file_list = ['15L-0.3', '16L-0.4_redo', '15L-0', '16L-redo']
rx = re.compile(r'^(\d )L(?:-(\d (?:\.\d )?))?([_-].*)?$')
array = []
for i in file_list:
m = rx.search(i)
if m:
arr = list(m.groups())
arr[0] = int(arr[0]) # This is an int
if arr[1]: # If Group 2 matched, it is either a doat or int float
arr[1] = ast.literal_eval(arr[1]) # Parse the second number as int or float
array.append([x for x in arr if x is not None]) # Remove any None values
print (array)
# => [[15, 0.3], [16, 0.4, '_redo'], [15, 0], [16, '-redo']]
See the Python demo. Here is the regex demo. Details:
^
- start of string(\d )
- Group 1: one or more digits (so, we can useint(arr[0])
safely)L
- anL
letter(?:-(\d (?:\.\d )?))?
- an optional sequence of-
- a hyphen(\d (?:\.\d )?)
- Group 2: one or more digits and then an optional sequence of a.
and one or more digits
([_-].*)?
- an optional Group 3:_
or-
and then any chars (other than line break chars withoutre.DOTALL
flag) up to the end of the string.
CodePudding user response:
You can use re.split
with a lookahead regex, and a list comprehension with a helper function:
import re
regex = re.compile('[-_](?=\d)|(?=[-_]\D)')
def toint(n):
n2 = n.rstrip('L')
if n2.replace('.', '', 1).isnumeric():
return float(n2) if '.' in n2 else int(n2)
else:
return n
[[toint(i) for i in l] for l in map(regex.split, file_list)]
output:
[[15, 0.3], [16, 0.4, '_redo'], [15, 0], [16, '-redo']]