Home > front end >  How to parse out two fields out of string using regex in Python?
How to parse out two fields out of string using regex in Python?

Time:08-14

I'm trying to figure out how to use regex to parse out fields from a naming scheme. Basically, a way to go through a query string and pull out patterns based on the naming scheme. In this case, there are two patterns to pull out which are the ID and the DIRECTION.

DIRECTION will always be either 1 or 2

ID can be any string that is allowed for file systems (e.g., alphanumeric - _ .)

Here is the basic framework I'm trying to code up:

def function(query:str, naming_scheme:str):
    # stuff
    return (ID, DIRECTION)

Here is a query for naming scheme 1 (naming_scheme_1):

naming_scheme_1 = "[ID]_R[DIRECTION].fastq.gz"
ID, DIRECTION = function("Kuwait_110_S59_R1.fastq.gz", naming_scheme_1)
#ID = "Kuwait_110_S59"
#DIRECTION = "1"

ID, DIRECTION = function("Kuwait_110_S59_R2.fastq.gz", naming_scheme_1)
#ID = "Kuwait_110_S59"
#DIRECTION = "2"

Here is a query for naming scheme 2 (naming_scheme_2):

naming_scheme_2 = "[ID]_R[DIRECTION]_001.fastq.gz"
ID, DIRECTION = function("Kuwait_110_S59_R1_001.fastq.gz", naming_scheme_2)
#ID = "Kuwait_110_S59"
#DIRECTION = "1"

ID, DIRECTION = function("Kuwait_110_S59_R2_001.fastq.gz", naming_scheme_2)
#ID = "Kuwait_110_S59"
#DIRECTION = "2"

Here is a query for naming scheme 3 (naming_scheme_3):

naming_scheme_3 = "barcode-Kuwait_110_S59_1.fq"

ID, DIRECTION = function("barcode-Kuwait_110_S59_1.fq", naming_scheme_3)
ID = "Kuwait_110_S59"
DIRECTION = "1"

ID, DIRECTION = function("barcode-Kuwait_110_S59_2.fq", naming_scheme_3)
ID = "Kuwait_110_S59"
DIRECTION = "2"

How can I use regex (or similar) in Python to parse out fields in this context?

My current method is to do a series of splitting events on a string which doesn't seem like the best option.

CodePudding user response:

If your 3rd naming scheme actually is

naming_scheme_3 = "barcode-[ID]_[DIRECTION].fq"

Then the Python code

import re

def get_id_and_direction(query: str):
    matcher = re.match("^(?:barcode-)?(?P<ID>[a-zA-Z0-9._-] )_R?(?P<DIRECTION>[12])(?:\.fq|(?:_001)?\.fastq\.gz)$",query)
    if matcher:
        return (matcher.group('ID'), matcher.group('DIRECTION'))
    else:
        return ( None, None )

print(get_id_and_direction('Kuwait_110_S59_R1.fastq.gz'))
print(get_id_and_direction('Kuwait_110_S59_R2.fastq.gz'))
print(get_id_and_direction('Kuwait_110_S59_R1_001.fastq.gz'))
print(get_id_and_direction('Kuwait_110_S59_R2_001.fastq.gz'))
print(get_id_and_direction('barcode-Kuwait_110_S59_1.fq'))
print(get_id_and_direction('barcode-Kuwait_110_S59_2.fq'))

will give you ID and DIRECTION for all 3 naming schemes at once:

('Kuwait_110_S59', '1')
('Kuwait_110_S59', '2')
('Kuwait_110_S59', '1')
('Kuwait_110_S59', '2')
('Kuwait_110_S59', '1')
('Kuwait_110_S59', '2')

The regex "^(?:barcode-)?(?P<ID>[a-zA-Z0-9._-] )_R?(?P<DIRECTION>[12])(?:\.fq|(?:_001)?\.fastq\.gz)$" works as follows:

^(?:barcode-)? looks for an optional 'barcode-' at the beginning - the ? at the end makes the whole expression optional.

(?P<ID>[a-zA-Z0-9._-] ) is the (named) group that picks up the ID made up of one or more alphanumeric or '.', '_', '-' characters.

_R? matches _R or just _ (the ? behind R makes the R optional) always following the ID.

(?P<DIRECTION>[12]) the (named) group that picks up a 1 or a 2 - the direction

(?:\.fq|(?:_001)?\.fastq\.gz)$ makes sure the string ends with either '.fq' or '_001.fastq.gz' or '.fastq.gz' the 3 possible endings in your 3 naming schemes

See the code in action here: https://onlinegdb.com/yD8WBaPNt

Hope that gets you going!

CodePudding user response:

The reason the commenters asked for parameters is that you didn't give any rules.

For example, does the ID always contain 'string_3 characters_3 characters'?

Is direction always a single character? Is it even more refined? Is it always a single digit?

I've provided an answer, but without sufficient parameters, this may not help you very much. If the assumptions I outlined in the comments in the code are true, this will work out just fine. That being said, if it doesn't work, give up some rules your strings must follow.

import re

str1 = "Kuwait_110_S59_R1.fastq.gz"
str2 = "Kuwait_110_S59_R1_001.fastq.gz"
str3 = "barcode-Kuwait_110_S59_1.fq"
str4 = "bar-Kuwait Kuwait_295_235_622.fg"

# this assumes 
#   the first char that matters for ID is always capitalized
#   always 3 characters between the 1st & 2nd hyphen & after 2nd hyphen
#   that direction is always a single character

def gimme(str):
  # look for the single char before period
  ID = re.search("(.)(?=(\.))", str).group(1)
  # look a capital letter then for *_3_3 before _
  DIRECTION = re.search("([A-Z].*_.{3}_.{3})(?=(_))", str).group(1)
  return (ID, DIRECTION)

s1 = gimme(str1)
s2 = gimme(str2)
s3 = gimme(str3)
s4 = gimme(str4)

print(s1)
# ('1', 'Kuwait_110_S59')
print(s2)
# ('1', 'Kuwait_110_S59')
print(s3)
# ('1', 'Kuwait_110_S59')
print(s4)
# ('2', 'Kuwait Kuwait_295_235')

CodePudding user response:

Here is the code:

import re

def repl(match_object):
    inside_bracket = match_object.group(1)
    if inside_bracket == "DIRECTION":
        return r"(?P<DIRECTION>[12])"
    if inside_bracket == "ID":
        return r"(?P<ID>[-.\w] )"

def function(query: str, naming_scheme: str):
    pattern = re.sub(r"\[(.*?)\]", repl, naming_scheme)
    match = re.match(pattern, query)
    return match["ID"], match["DIRECTION"]

Explanation:

The most important thing was converting your template into a regex pattern, I mean:

[ID]_R[DIRECTION].fastq.gz   -->  (?P<ID>[-\w] )_R(?P<DIRECTION>[12]).fastq.gz

This is done with the help of repl function which is passed to the re.sub. In this function I used \[(.*?)\] as the pattern, it basically catches brackets and their content. When creating the pattern, I used your rules for DIRECTION and ID. [DIRECTION] changed to a named group (?P<DIRECTION>[12]) which only accepts 1 and 2 and [ID] changed to (?P<ID>[-.\w] ) for filenames (assuming there is no space in the filename)

That's it. Now you have your pattern which includes two named groups. 1- ID 2- DIRECTION. They can be fetched with match["ID"] and match["DIRECTION"]

here is a test:

ID, DIRECTION = function("Kuwait_110_S59_R1.fastq.gz", "[ID]_R[DIRECTION].fastq.gz")
print(ID, DIRECTION)

output:

Kuwait_110_S59
1

note: I just accounted happy cases, don't forget to raise exception if your template(query) is not in a good shape.

  • Related