I´m working with two text files that look like this:
# See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
# assembly_accession bioproject biosample wgs_master refseq_category taxid species_taxid organism_name infraspecific_name isolate version_status assembly_level release_type genome_rep seq_rel_date asm_name submitter gbrs_paired_asm paired_asm_comp ftp_path excluded_from_refseq relation_to_type_material asm_not_live_date
GCA_000739415.1 PRJNA245225 SAMN02732406 na 837 837 Porphyromonas gingivalis strain=HG66 latest Chromosome Major Full 2014/08/14 ASM73941v1 University of Louisville GCF_000739415.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/739/415/GCA_000739415.1_ASM73941v1 an
...
So, I want to search for a specific pattern using regex. For example, file 1 has this pattern:
GCF_000739415.1
and file 2 this one:
GCA_000739415.1
The only difference is the third character: F versus A. However, sometimes numbers differ. These two files have a lot of patterns like the previous one, however, there are some differences. My goal is to search for the pattern that only exists in one file and not in the other file.
I´m working on a python code:
# PART 1: Open and read text file
with open("assembly_summary_genbank.txt", 'r') as f_1:
contents_1 = f_1.readlines()
with open("assembly_summary_refseq.txt", 'r') as f_2:
contents_2 = f_2.readlines()
# PART 2: Search for IDs
matches_1 = re.findall("GCF_[0-9]*\.[0-9]", str(contents_1))
matches_2 = re.findall("GCA_[0-9]*\.[0-9]", str(contents_2))
# PART 3: Match between files
# Seudocode
for line in matches_1:
if matches_1 == matches_2:
print("PATTERN THAT ONLY EXIST IN ONE FILE")
Part 3 refers to doing a for loop that searches for each line in both files and prints the patterns that only exist in one file and not in the other one. Any idea for doing this for loop?
CodePudding user response:
Perhaps you are after this?
import re
given_example = "GCA_000739415.1 PRJNA245225 SAMN02732406 na 837 837 Porphyromonas gingivalis strain=HG66 latest Chromosome Major Full 2014/08/14 ASM73941v1 University of Louisville GCF_000739415.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/739/415/GCA_000739415.1_ASM73941v1 an"
altered_example = "GCA_000739415.1 GCTEST_000739415.1"
# GX[A or F]_[number; digit >= 1].[number; digit >= 1]
regex = r"GC[AF]_\d .\d "
matches_1 = re.findall(regex, given_example)
matches_2 = re.findall(regex, altered_example)
# Iteration for intersection
for match in matches_1:
if match in matches_2:
print(f"{match} is in both files")
Prints
GCA_000739415.1 is in both files
GCA_000739415.1 is in both files
But I would recommend:
# The preferred method for intersection, where order is not important
matches = list(set(matches_1) & set(matches_2))
Which saves as:
['GCA_000739415.1']
Note the regex matches in a form of GX[A or F]_[number; digit >= 1].[number; digit >= 1]
. Let me know if this is not what you are after
CodePudding user response:
As I am looking at these files, it is a data frame from the txt file link you shared. ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt
The assembly_summary.txt files have 22 tab-delimited columns. Header rows begin with '#".
A better approach would start would be to open the files as tab-separated pandas data frames and apply a function to split and replace all the F with A, and then merge the two files or simply use the is in
command to get the indices of the common elements. Here is the code:
Notebook with rationale and algorithm, with comments -
https://colab.research.google.com/drive/1jJYnDpMVCt1spRsUek7RqlMnC_2O92sq?usp=sharing
import pandas as pd
url1="https://ftp.ncbi.nlm.nih.gov/genomes/genbank/assembly_summary_genbank.txt"
url2 = "https://ftp.ncbi.nlm.nih.gov/genomes/refseq/assembly_summary_refseq.txt"
df1=pd.read_csv(url1,sep='\t',low_memory=False)
df2=pd.read_csv(url2,sep='\t',low_memory=False)
def replaceFs(string_test):
list_words = list(string_test)
list_words[2]='A'
return_string = ''.join(list_words)
return return_string
def table_reform(unformed_df):
unformed_df = unformed_df.reset_index()
unformed_df = unformed_df.rename(columns=unformed_df.iloc[0])
reformed_df = unformed_df[1:]
return reformed_df
df1 = table_reform(df1)
df2 = table_reform(df2)
df2['# assembly_accession'] = df2['# assembly_accession'].apply(replaceFs)
df_combine = pd.merge(df1,df2,on=['# assembly_accession'],how='inner')
df_combine
Which shows a huge data frame 254000 rows × 45 columns
<script src="https://gist.github.com/paritoshk/7eed427943399237c14b911adfee4428.js"></script>
Hope this helps! My personal view from the CS angle is that double loops with Regex are much slower on such large datasets vs Pandas algorithms.