Home > Blockchain >  search for regex match between two files using python
search for regex match between two files using python

Time:04-08

I´m working with two text files that look like this:

#   See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
# assembly_accession    bioproject  biosample   wgs_master  refseq_category taxid   species_taxid   organism_name   infraspecific_name  isolate version_status  assembly_level  release_type    genome_rep  seq_rel_date    asm_name    submitter   gbrs_paired_asm paired_asm_comp ftp_path    excluded_from_refseq    relation_to_type_material   asm_not_live_date
GCA_000739415.1 PRJNA245225 SAMN02732406        na  837 837 Porphyromonas gingivalis    strain=HG66     latest  Chromosome  Major   Full    2014/08/14  ASM73941v1  University of Louisville    GCF_000739415.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/739/415/GCA_000739415.1_ASM73941v1         an
...

So, I want to search for a specific pattern using regex. For example, file 1 has this pattern:

GCF_000739415.1

and file 2 this one:

GCA_000739415.1

The only difference is the third character: F versus A. However, sometimes numbers differ. These two files have a lot of patterns like the previous one, however, there are some differences. My goal is to search for the pattern that only exists in one file and not in the other file.

I´m working on a python code:

# PART 1: Open and read text file
with open("assembly_summary_genbank.txt", 'r') as f_1:
    contents_1 = f_1.readlines()
with open("assembly_summary_refseq.txt", 'r') as f_2:
    contents_2 = f_2.readlines()

# PART 2: Search for IDs
matches_1 = re.findall("GCF_[0-9]*\.[0-9]", str(contents_1))
matches_2 = re.findall("GCA_[0-9]*\.[0-9]", str(contents_2))

# PART 3: Match between files
# Seudocode
for line in matches_1:
    if matches_1 == matches_2:
        print("PATTERN THAT ONLY EXIST IN ONE FILE")

Part 3 refers to doing a for loop that searches for each line in both files and prints the patterns that only exist in one file and not in the other one. Any idea for doing this for loop?

CodePudding user response:

Perhaps you are after this?

import re

given_example = "GCA_000739415.1 PRJNA245225 SAMN02732406        na  837 837 Porphyromonas gingivalis    strain=HG66     latest  Chromosome  Major   Full    2014/08/14  ASM73941v1  University of Louisville    GCF_000739415.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/739/415/GCA_000739415.1_ASM73941v1         an"
altered_example = "GCA_000739415.1 GCTEST_000739415.1"

# GX[A or F]_[number; digit >= 1].[number; digit >= 1]
regex = r"GC[AF]_\d .\d "

matches_1 = re.findall(regex, given_example)
matches_2 = re.findall(regex, altered_example)

# Iteration for intersection
for match in matches_1:
    if match in matches_2:
        print(f"{match} is in both files")

Prints

GCA_000739415.1 is in both files
GCA_000739415.1 is in both files

But I would recommend:

# The preferred method for intersection, where order is not important
matches = list(set(matches_1) & set(matches_2))

Which saves as:

['GCA_000739415.1']

Note the regex matches in a form of GX[A or F]_[number; digit >= 1].[number; digit >= 1]. Let me know if this is not what you are after

Regex demo here

CodePudding user response:

As I am looking at these files, it is a data frame from the txt file link you shared. ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt

The assembly_summary.txt files have 22 tab-delimited columns. 
Header rows begin with '#".

A better approach would start would be to open the files as tab-separated pandas data frames and apply a function to split and replace all the F with A, and then merge the two files or simply use the is in command to get the indices of the common elements. Here is the code:

Notebook with rationale and algorithm, with comments -

https://colab.research.google.com/drive/1jJYnDpMVCt1spRsUek7RqlMnC_2O92sq?usp=sharing

import pandas as pd
url1="https://ftp.ncbi.nlm.nih.gov/genomes/genbank/assembly_summary_genbank.txt"
url2 = "https://ftp.ncbi.nlm.nih.gov/genomes/refseq/assembly_summary_refseq.txt"
df1=pd.read_csv(url1,sep='\t',low_memory=False)
df2=pd.read_csv(url2,sep='\t',low_memory=False)
def replaceFs(string_test):
  list_words = list(string_test)
  list_words[2]='A'
  return_string = ''.join(list_words)
  return return_string
def table_reform(unformed_df):
  unformed_df = unformed_df.reset_index()
  unformed_df = unformed_df.rename(columns=unformed_df.iloc[0])
  reformed_df = unformed_df[1:]
  return reformed_df
df1 = table_reform(df1)
df2 = table_reform(df2)
df2['# assembly_accession'] = df2['# assembly_accession'].apply(replaceFs)
df_combine = pd.merge(df1,df2,on=['# assembly_accession'],how='inner') 
df_combine

Which shows a huge data frame 254000 rows × 45 columns

<script src="https://gist.github.com/paritoshk/7eed427943399237c14b911adfee4428.js"></script>

Hope this helps! My personal view from the CS angle is that double loops with Regex are much slower on such large datasets vs Pandas algorithms.

  • Related