I have a Pandas df with 2 columns:
name Count_Relationship
0 allicin DOWNREGULATE: 1
1 allicin DOWNREGULATE: 2
2 allicin UPREGULATE: 1 | DOWNREGULATE: 1
3 aspirin UPREGULATE: 5 | DOWNREGULATE: 1
4 albuterol DOWNREGULATE: 1
5 albuterol UPREGULATE: 3
I would like to filter out only the rows where if I group by the 'names' and count in the 'Count_Relationship' column the amount of DOWNREGULATE to be more than the amount of the UPREGULATE. In this case allicin would have DOWREGULATE 1 2 1=4 and UPREGULATE =1 so num_downregulate>num_upregulate, while in the rest(aspirin,albuterol) that's not the case. I would like to return this filtered df:
name Count_Relationship
0 allicin DOWNREGULATE: 1
1 allicin DOWNREGULATE: 2
2 allicin UPREGULATE: 1 | DOWNREGULATE: 1
The column Count_Relationship is a string, so I would have to parse the number part of the string and convert it to int.
I tried this:
import pandas as pd
data = {'name': ['allicin', 'allicin', 'allicin', 'aspirin', 'albuterol', 'albuterol'],
'Count_Relationship': ['DOWNREGULATE: 1', 'DOWNREGULATE: 2', 'UPREGULATE: 1 | DOWNREGULATE: 1', 'UPREGULATE: 5 | DOWNREGULATE: 1', 'DOWNREGULATE: 1' , 'UPREGULATE: 3']
}
df = pd.DataFrame(data)
substances = df["name"].tolist()
substances = list(set(substances)) # to get the unique names
result_substances = []
for substance in (substances):
try:
numberOfdownregulate = df[(df["name"] == substance) & (\
(df["Count_Relationship"].str.match(pat = '("DOWNREGULATE:"([0-9]))')).values[0].astype(int)
except:
pass
try:
numberOfupregulate = df[(df["name"] == substance) & (\
(df["Count_Relationship"].str.match(pat = '("UPREGULATE:"([0-9]))')).values[0].astype(int)
except:
pass
result = numberOfdownregulate - numberOfupregulate
if result > 0:
result_substances.append(substance)
df_filtered = df[df["name"].isin(result_substances)]
but I get a syntax error at the line numberOfdownregulate where my regex is. How can fix the algorithm? thanks so much
CodePudding user response:
You could extract the infos, compare the up and down, and build a mask to select the data:
drugs = (df.join(df['Count_Relationship'].str.extractall('(?P<down>(?<=DOWNREGULATE: )\d )|(?P<up>(?<=UPREGULATE: )\d )')
.groupby(level=0).first().fillna(0).astype(int)
)
.groupby('name').agg({'down': 'sum', 'up': 'sum'})
.query('down >= up')
.index
)
df[df['name'].isin(drugs)]
output:
name Count_Relationship
0 allicin DOWNREGULATE: 1
1 allicin DOWNREGULATE: 2
2 allicin UPREGULATE: 1 | DOWNREGULATE: 1
CodePudding user response:
I suggest extracting the DOWNREGULATE and UPREGULATE values into different columns, then apply the sum of the values grouped by name and check which is bigger.
The example below creates an additional boolean column named UP_gt_DOWN
, literally UPREGULATE greater than DOWNREGULATE:
df['UPREGULATE'] = df['Count_Relationship'].str.extract(r"UPREGULATE: (\d*)").fillna(0).astype(int)
df['DOWNREGULATE'] = df['Count_Relationship'].str.extract(r"DOWNREGULATE: (\d*)").fillna(0).astype(int)
summed_df = df.groupby('name').sum()
summed_df['UP_gt_DOWN'] = summed_df['UPREGULATE'] > summed_df['DOWNREGULATE']
print(summed_df)
# Output
# UPREGULATE DOWNREGULATE UP_gt_DOWN
# name
# albuterol 3 1 True
# allicin 1 4 False
# aspirin 5 1 True
filtered_drugs = summed_df[~summed_df['UP_gt_DOWN']].index
print(df[df['name'].isin(filtered_drugs)])
# Output
# name Count_Relationship UPREGULATE DOWNREGULATE
# 0 allicin DOWNREGULATE: 1 0 1
# 1 allicin DOWNREGULATE: 2 0 2
# 2 allicin UPREGULATE: 1 | DOWNREGULATE: 1 1 1