I am doing with pandas what is called in bioinformatic interset or overlapping or mapping. This is having 2 data sets, map or land regions between them. See code below.
gene_df = pd.DataFrame({
"start": [11, 41],
"end": [19, 49],
"chrm": [1, 1],
"strand": ["-", " "]
})
start end chrm strand
0 11 19 1 -
1 41 49 1
df = pd.DataFrame({
"start": [10, 40, 100],
"end": [20, 50, 150],
"chrm": [1, 1, 2],
"gene_name": ["gene1", "gene2", "gene3"],
"gene_id": ["g1", "g2", "g3"]
})
start end chrm gene_name gene_id
0 10 20 1 gene1 g1
1 40 50 1 gene2 g2
2 100 150 2 gene3 g3
all_results = [] # A temporary container to store our results
for index, row in gene_df.iterrows():
# Lookup region and store result as DataFrame
# Note: I didn't understand what you meant by overlap, so this searches for
# regions in df that overlap regions in gene_df. If you're looking for the opposite,
# just reverse the <= to >= and >= to <=
results_df = df.loc[(df["chrm"] == row["chrm"]) & (df["start"] <= row["start"]) & (df["end"] >= row["end"])]
# Add coordinates from gene file to the result DataFrame
results_df["gene_df_start"] = row["start"]
results_df["gene_df_end"] = row["end"]
# Store results in our container
all_results.append(results_df)
# When done with all rows, gather all results into a single DataFrame
finished_df = pd.concat(all_results)
print (finished_df)
start end chrm gene_name gene_id gene_df_start gene_df_end
0 10 20 1 gene1 g1 11 19
1 40 50 1 gene2 g2 41 49
As you can see, this is basically a filtering process and genes that map remains in finished_df. This does partly what I want, however, I don't want to lose the genes that didn't map I want all of them. I would like to have something like this
start end chrm gene_name gene_id gene_affected gene_df_start gene_df_end
0 10 20 1 gene1 g1 yes 11 19
1 40 50 1 gene2 g2 yes 41 49
2 100 150 2 gene3 g3 no Nan Nan
I cannot find a simple way to do this.
CodePudding user response:
Add column gene_affected
to finished_df
, merge with original by all same columns (omitted on
parameter) with left join and last repalce missing values in finished_df
:
finished_df.insert(len(finished_df.columns) - 2, 'gene_affected', 'yes')
df = df.merge(finished_df, how='left').fillna({'gene_affected': 'no'})
CodePudding user response:
You can replace your loop by merge
which is more efficient if your dataframe is large:
out = (
df.merge(gene_df[['chrm', 'start', 'end']], on='chrm', how='outer',
suffixes=('', '_gene_df'), indicator='gene_affected')
.query("(gene_affected != 'both') | ((start <= start_gene_df) & (end >= end_gene_df))")
.replace({'gene_affected': {'both': 'yes', 'left_only': 'no'}}).reset_index(drop=True)
)
Output:
>>> out
start end chrm gene_name gene_id start_gene_df end_gene_df gene_affected
0 10 20 1 gene1 g1 11.0 19.0 yes
1 40 50 1 gene2 g2 41.0 49.0 yes
2 100 150 2 gene3 g3 NaN NaN no