I'm trying to lookup the values on a certain column and copy the remaining column based on that lookup. The thing is, the number of row in this operation is more than 20 million rows.
I tried to run the code, but it did not stop for like 8 hours and then I stop it. My question is:
Is my algorithm correct? If its correct, is the cause of this non-stop running is due to my inefficient algorithm?
Here's my code and tables to illustrate:
Table 1
A | B |
---|---|
12 | abc |
13 | def |
28 | ghi |
50 | jkl |
Table 2 (Lookup to this table)
B | C | D |
---|---|---|
abc | 4 | 7 |
def | 3 | 3 |
ghi | 6 | 2 |
jkl | 8 | 1 |
Targeted result
A | B | C | D |
---|---|---|---|
12 | abc | 4 | 7 |
13 | def | 3 | 3 |
28 | ghi | 6 | 2 |
50 | jkl | 8 | 1 |
So the column of C and D will be added to table 1 also but lookup to table 2 of column B
This value on Table 1 is located in different CSV files, so I also loop through the files in the folder. I name the directory as all_files
in the code. So, after looking up
My code:
df = pd.DataFrame()
for f in all_files:
Table1 = pd.read_csv(all_files[f])
for j in range(len(Table1)):
u = Table1.loc[j,'B']
for z in range(len(Table2)):
if u == Table2.loc[z,'B']:
Table1.loc[j,'C'] = Table2.loc[z,'C']
Table1.loc[j,'D'] = Table2.loc[z,'D']
break
df = pd.concat([df,Table1],axis=0)
I used that break at the end just to stop the looping when it finds the same value and Table1 is concatenated to df. This code here didn't work on me, loops continuously and never stops.
Can anyone help? Any help will be very much appreciated!
CodePudding user response:
I hope this is the solution you are looking for:
Firstly I would join all the CSV for table_1
together as a single DataFrame. Then I would merge the table_2
to table_1
with the key of Column B. Sample code:
df = pd.DataFrame()
for file in all_file:
df_tmp = pd.read_csv(file)
df = pd.concat([df, df_tmp])
df_merge = pd.merge(df, table_2, on="B", how="left")
CodePudding user response:
When we use for-loop
with Pandas, there is a 98% chance that we are doping it wrong. Pandas is design for you not to use loops.
Solutions with increasing performance:
import pandas as pd
table_1 = pd.DataFrame({'A': [12, 13, 28, 50], 'B': ['abc', 'def', 'ghi','jkl']})
table_2 = pd.DataFrame({'B': ['abc', 'def', 'ghi','jkl'], 'C': [4, 3, 6, 8], 'D': [7, 3, 2, 1]})
# simple merge
table = pd.merge(table_1, table_2, how='inner', on='B')
# gain speed by using indexing
table_1 = table_1.set_index('B')
table_2 = table_2.set_index('B')
table = pd.merge(table_1, table_2, how='inner', left_index=True, right_index=True)
# there is also join but it’s slow than merge
table = table_1.join(table_2, on="B").reset_index()