I am trying to classify genomic locations, and I have DataFrame like the following, with all locations and their respective classification types. The Type
column does not have unique classifications, but each row will have a unique combination of Chr
, Low
, High
.
pd.DataFrame({
'Chr':[1,1,3],
'Low':[100,200,300],
'High':[150,250,350],
'Type':['Foo','Bar','Foo']
})
I then have my sample set that needs to be classified like the DataFrame below.
pd.DataFrame({
'Chr':[1,1,5],
'Loc':[125,325,325]
})
To classify the data, for each location in the sample set, the Chromosome position found within the Chr
column must match a Chr
value found within the reference DataFrame and the Loc
value must be >=
the Low
value and <=
the High
value. If this happens, that row should then be labeled with the respective Type
in the reference DataFrame. In the example I provide, the sample set should be labeled like the following.
pd.DataFrame({
'Chr':[1,1,5],
'Loc':[125,325,325],
'Type':['Foo','None','None']
})
which looks like:
Chr Loc Type
0 1 125 Foo
1 1 325 None
2 5 325 None
CodePudding user response:
You could try this:
df2 = df2.assign(Type=None)
for l in df2["Loc"]:
i = list(df2["Loc"]).index(l)
if df1["Low"][i] < l < df1["High"][i]:
df2["Type"][i] = df1["Type"][i]
Output:
Chr Loc Type
0 1 125 Foo
1 1 325 None
2 5 325 None
CodePudding user response:
You could merge
the two on "Chr". Then on the merged DataFrame, see if "Loc" falls between "Low" and "High" and use where
to fill "Type" with NaN values if it doesn't. Finally, drop irrelevant columns and duplicate rows:
merged = sample.merge(df, on='Chr', how='left')
merged['Type'] = merged['Type'].where(merged['Loc'].between(merged['Low'], merged['High']))
out = merged.drop(columns=['Low','High']).drop_duplicates(subset=['Chr','Loc'])
Output:
Chr Loc Type
0 1 125 Foo
2 1 325 NaN
4 5 325 NaN
CodePudding user response:
You can try apply()
to check for each row if the conditions return a Type
like this:
# Create your data frames
cat = pd.DataFrame({
'Chr':[1,1,3],
'Low':[100,200,300],
'High':[150,250,350],
'Type':['Foo','Bar','Foo']
})
test = pd.DataFrame({
'Chr':[1,1,5],
'Loc':[125,325,325]
})
# Check if there is type for each row
test['Type'] = test.apply(lambda x: cat[(cat['Chr']==x['Chr']) & (cat['Low'] < x['Loc']) & (cat['High'] > x['Loc'])]['Type'], axis=1)
test
Output:
Chr Loc Type
0 1 125 Foo
1 1 325 NaN
2 5 325 NaN