I would like to speed up a loop over a python Pandas Dataframe. Unfortunately, decades of using low-level languages mean I often struggle to find prepackaged solutions. Note: data is private, but I will see if I can fabricate something and add it into an edit if it helps.
The code has three pandas dataframes: drugUseDF
, tempDF
, which holds the data, and tempDrugUse
, which stores what's been retrieved. I look over every row of tempDF
(there will be several million rows), retrieving the prodcode
identified from each row and then using that to retrieve the corresponding value from use1
column in the drugUseDF
. I've added comments to help navigate.
This is the structure of the dataframes:
tempDF
patid eventdate consid prodcode issueseq
0 20001 21/04/2005 2728 85 0
1 25001 21/10/2000 3939 40 0
2 25001 21/02/2001 3950 37 0
drugUseDF
index prodcode ... use1 use2
0 171 479 ... diabetes NaN
1 172 9105 ... diabetes NaN
2 173 5174 ... diabetes NaN
tempDrugUse
use1
0 NaN
1 NaN
2 NaN
This is the code:
dfList = []
# if the drug dataframe contains the use1 column. Can this be improved?
if sum(drugUseDF.columns.isin(["use1"])) == 1:
#predine dataframe where we will store the results to be the same length as the main data dataframe.
tempDrugUse = DataFrame(data=None, index=range(len(tempDF.index)), dtype=np.str, columns=["use1"])
#go through each row of the main data dataframe.
for ind in range(len(tempDF)):
#retrieve the prodcode from the *ind* row of the main data dataframe
prodcodeStr = tempDF.iloc[ind]["prodcode"]
#get the corresponding value from the use1 column matching the prodcode column
useStr = drugUseDF[drugUseDF.loc[:, "prodcode"] == prodcodeStr]["use1"].values[0]
#update the storing dataframe
tempDrugUse.iloc[ind]["use1"] = useStr
print("[DEBUG] End of loop for use1")
dfList.append(tempDrugUse)
The order of the data matters. I can't retrieve multiple rows by matching the prodcode because each row has a date column. Retrieving multiple rows and adding them to the tempDrugUse
dataframe could mean that the rows are no longer in chronological date order.
CodePudding user response:
When trying to combine data in two dataframes you should use the merge (similar to JOIN in sql-like languages). Performance wise, you should never loop over the rows - you should use the pandas built-in methods whenever possible. Ordering can be achieved with the sort_values method.
CodePudding user response:
If I understand you correctly, you want to map the prodcode from both tables. You can do this via pd.merge (please note the example in the code below differs from your data):
tempDF = pd.DataFrame({'patid': [20001, 25001, 25001],
'prodcode': [101,102,103]})
drugUseDF = pd.DataFrame({'prodcode': [101,102,103],
'use1': ['diabetes', 'hypertonia', 'gout']})
merged_df = pd.merge(tempDF, drugUseDF, on='prodcode', how='left')