Mapping regions (interset) without missing the ones that did not map-CodePudding

I am doing with pandas what is called in bioinformatic interset or overlapping or mapping. This is having 2 data sets, map or land regions between them. See code below.

gene_df = pd.DataFrame({
    "start": [11, 41],
    "end": [19, 49],
    "chrm": [1, 1],
    "strand": ["-", " "]
})
start  end  chrm strand
0     11   19     1      -
1     41   49     1       


df = pd.DataFrame({
    "start": [10, 40, 100],
    "end": [20, 50, 150],
    "chrm": [1, 1, 2],
    "gene_name": ["gene1", "gene2", "gene3"],
    "gene_id": ["g1", "g2", "g3"]
})

start  end  chrm gene_name gene_id
0     10   20     1     gene1      g1
1     40   50     1     gene2      g2
2    100  150     2     gene3      g3


all_results = []  # A temporary container to store our results
for index, row in gene_df.iterrows():
    # Lookup region and store result as DataFrame
    # Note: I didn't understand what you meant by overlap, so this searches for
    # regions in df that overlap regions in gene_df. If you're looking for the opposite,
    # just reverse the <= to >= and >= to <=
    results_df = df.loc[(df["chrm"] == row["chrm"]) & (df["start"] <= row["start"]) & (df["end"] >= row["end"])]
    #  Add coordinates from gene file to the result DataFrame
    results_df["gene_df_start"] = row["start"]
    results_df["gene_df_end"] = row["end"]
    # Store results in our container
    all_results.append(results_df)

# When done with all rows, gather all results into a single DataFrame
finished_df = pd.concat(all_results)

print (finished_df)


start  end  chrm gene_name gene_id  gene_df_start  gene_df_end
0     10   20     1     gene1      g1             11           19
1     40   50     1     gene2      g2             41           49

As you can see, this is basically a filtering process and genes that map remains in finished_df. This does partly what I want, however, I don't want to lose the genes that didn't map I want all of them. I would like to have something like this

start  end  chrm gene_name gene_id     gene_affected      gene_df_start  gene_df_end
0     10   20     1     gene1      g1     yes                 11           19
1     40   50     1     gene2      g2     yes                 41           49
2    100  150     2     gene3      g3     no                  Nan          Nan

I cannot find a simple way to do this.

CodePudding user response：

Add column gene_affected to finished_df, merge with original by all same columns (omitted on parameter) with left join and last repalce missing values in finished_df:

finished_df.insert(len(finished_df.columns) - 2, 'gene_affected', 'yes')
df = df.merge(finished_df, how='left').fillna({'gene_affected': 'no'})

CodePudding user response：

You can replace your loop by merge which is more efficient if your dataframe is large:

out = (
  df.merge(gene_df[['chrm', 'start', 'end']], on='chrm', how='outer', 
           suffixes=('', '_gene_df'), indicator='gene_affected')
    .query("(gene_affected != 'both') | ((start <= start_gene_df) & (end >= end_gene_df))")
    .replace({'gene_affected': {'both': 'yes', 'left_only': 'no'}}).reset_index(drop=True)
)

Output:

>>> out
   start  end  chrm gene_name gene_id  start_gene_df  end_gene_df gene_affected
0     10   20     1     gene1      g1           11.0         19.0           yes
1     40   50     1     gene2      g2           41.0         49.0           yes
2    100  150     2     gene3      g3            NaN          NaN            no