Check if series of numbers is between two columns in pandas dataframe-CodePudding

I am trying to classify genomic locations, and I have DataFrame like the following, with all locations and their respective classification types. The Type column does not have unique classifications, but each row will have a unique combination of Chr, Low, High.

pd.DataFrame({
    'Chr':[1,1,3],
    'Low':[100,200,300],
    'High':[150,250,350],
    'Type':['Foo','Bar','Foo']
})

I then have my sample set that needs to be classified like the DataFrame below.

pd.DataFrame({
    'Chr':[1,1,5],
    'Loc':[125,325,325]
})

To classify the data, for each location in the sample set, the Chromosome position found within the Chr column must match a Chr value found within the reference DataFrame and the Loc value must be >= the Low value and <= the High value. If this happens, that row should then be labeled with the respective Type in the reference DataFrame. In the example I provide, the sample set should be labeled like the following.

pd.DataFrame({
    'Chr':[1,1,5],
    'Loc':[125,325,325],
    'Type':['Foo','None','None']
})

which looks like:

   Chr  Loc  Type
0    1  125   Foo
1    1  325  None
2    5  325  None

CodePudding user response：

You could try this:

df2 = df2.assign(Type=None)

for l in df2["Loc"]:
    i = list(df2["Loc"]).index(l)
    if df1["Low"][i] < l < df1["High"][i]:
        df2["Type"][i] = df1["Type"][i]

Output:

   Chr  Loc  Type
0    1  125   Foo
1    1  325  None
2    5  325  None

CodePudding user response：

You could merge the two on "Chr". Then on the merged DataFrame, see if "Loc" falls between "Low" and "High" and use where to fill "Type" with NaN values if it doesn't. Finally, drop irrelevant columns and duplicate rows:

merged = sample.merge(df, on='Chr', how='left')
merged['Type'] = merged['Type'].where(merged['Loc'].between(merged['Low'], merged['High']))
out = merged.drop(columns=['Low','High']).drop_duplicates(subset=['Chr','Loc'])

Output:

   Chr  Loc Type
0    1  125  Foo
2    1  325  NaN
4    5  325  NaN

CodePudding user response：

You can try apply() to check for each row if the conditions return a Type like this:

# Create your data frames
cat = pd.DataFrame({
    'Chr':[1,1,3],
    'Low':[100,200,300],
    'High':[150,250,350],
    'Type':['Foo','Bar','Foo']
})
test = pd.DataFrame({
    'Chr':[1,1,5],
    'Loc':[125,325,325]
})

# Check if there is type for each row
test['Type'] = test.apply(lambda x: cat[(cat['Chr']==x['Chr']) & (cat['Low'] < x['Loc']) & (cat['High'] > x['Loc'])]['Type'], axis=1)

test

Output:

    Chr Loc Type
0   1   125 Foo
1   1   325 NaN
2   5   325 NaN