Filter rows based on multiple columns entries-CodePudding

I have a dataframe which contains millions of entries and looks something like this:

Chr	Start	Alt
1	21651521	A
1	41681521	T
1	41681521	T
...	...	...
X	423565	T

I am currently trying to count the number of rows that match several conditions at the same time, i.e. Chr==1, Start==41681521 and Alt==T. Right now I am using this syntax, which works fine, but seems unpythonic and is also rather slow I think.

num_occurrence = sum((df["Chr"] == chrom) & 
                      (df["Start"] == int(position)) & 
                      (df["Alt"] == allele))

Does anyone have an approach which is more suitable then mine? Any help is much appreciated!

Cheers!

CodePudding user response：

Use DataFrame.all Series.sum:

res = (df[["Chr", "Start", "Alt"]] == [chrom, int(position), allele]).all(1).sum()

For example:

import pandas as pd

# toy data
df = pd.DataFrame(data=[[1, 21651521, "A"], [1, 41681521, "T"], [1, 41681521, "T"]], columns=["Chr", "Start", "Alt"])
chrom, position, allele = 1, "21651521", "A"


res = (df[["Chr", "Start", "Alt"]] == [chrom, int(position), allele]).all(1).sum()
print(res)

Output

CodePudding user response：

Alternative 1: `pd.DataFrame.query()`

You could work with query (see also the illustrative examples here):

expr = "Chr=={chr} & Start=={pos} & Alt=='{alt}'"
ret = df.query(expr.format(chr=chrom, pos=int(position), alt=allele))

In my experiments, this led already to a considerable speedup.

Optimizing this further requires additional information about the data types involved. There are several things you could try:

Alternative 2: Query sorted data

If you can afford to sort your DataFrame prior to querying, you can use pd.Series.searchsorted(). Here is a possible approach:

def query_sorted(df, chrom, position, allele):
    """
    Returns index of the matches.
    """
    assert df["Start"].is_monotonic_increasing
    i_min, i_max = df["Start"].searchsorted([position, position 1])
    df = df.iloc[i_min:i_max]
    return df[(df["Chr"] == chrom) & (df["Alt"] == allele)].index

# Usage: first sort df by column "Start", then query:
df = df.sort_values("Start")
ret_index = query_sorted(df, chrom, position, allele)
print(len(ret_index))

Alternative 3: Use hashes

Another idea would be to use hashes. Again, this requires some calculations up front, but it speeds up the query considerably. Here is an example based on pd.util.hash_pandas_object():

def query_hash(df, chrom, position, allele):
    """
    Returns a view on df
    """
    assert "hash" in df
    dummy = pd.DataFrame([[chrom, position, allele]])
    query_hash = pd.util.hash_pandas_object(dummy, index=False).squeeze()
    return df[df["hash"] == query_hash].index

# Usage: first compute hashes over the columns of interest, then query
df["hash"] = pd.util.hash_pandas_object(df[["Chr", "Start", "Alt"]], 
                                        index=False)
ret_index = query_hash(df, chrom, position, allele)
print(len(ret_index))

There are many more ways to improve this, though the alternative approaches will depend on your specific needs.

Results

I tested with n=10M samples on a MacBook Pro (Mid 2015), running Python 3.8, Pandas 1.2.4 and IPython 7.24.1

# original (sum(s)):  1642.0 ms ± 19.1 ms
# original (s.sum()):  639.0 ms ± 21.9 ms
# query():             175.0 ms ±  1.1 ms 
# query_sorted():       17.5 ms ± 60.4 µs
# query-hash():         10.6 ms ± 62.5 µs

Implementation

This is how I constructed the data and compared the different approaches.

import numpy as np 
import pandas as pd

# Create test data
n = int(10*1e6)
df = pd.DataFrame({"Chr": np.random.randint(1,23 1,n),
                   "Start": np.random.randint(100,999, n),
                   "Alt": np.random.choice(list("ACTG"), n)})

# Query point
chrom, position, allele = 1, 142, "A"

# Create test data
n = 10000000
df = pd.DataFrame({"Chr": np.random.randint(1,23 1,n),
                   "Start": np.random.randint(100,999, n),
                   "Alt": np.random.choice(list("ACTG"), n)})

# Query point
chrom, position, allele = 1, 142, "A"

# Measure performance in IPython
print("original (sum(s)):")
%timeit sum((df["Chr"] == chrom) & \
            (df["Start"] == int(position)) & \
            (df["Alt"] == allele))

print("original (s.sum()):")
%timeit ((df["Chr"] == chrom) & \
         (df["Start"] == int(position)) & \
         (df["Alt"] == allele)).sum()

print("query():")
%timeit len(df.query(expr.format(chr=chrom, \
                                 pos=position, \
                                 alt=allele)))

print("query_sorted():")
df_sorted = df.sort_values("Start")
%timeit query_sorted(df_sorted, chrom, position, allele)

print("query-hash():")
df_hash = df.copy()
df_hash["hash"] = pd.util.hash_pandas_object(df_hash[["Chr", "Start", "Alt"]], 
                                             index=False)
%timeit query_hash(df_hash, chrom, position, allele)

Alternative 1: pd.DataFrame.query()

Alternative 2: Query sorted data

Alternative 3: Use hashes

Results

Implementation

Alternative 1: `pd.DataFrame.query()`