Pandas DataFrame lookup for loop: for loop won't stop running-CodePudding

I'm trying to lookup the values on a certain column and copy the remaining column based on that lookup. The thing is, the number of row in this operation is more than 20 million rows.

I tried to run the code, but it did not stop for like 8 hours and then I stop it. My question is:

Is my algorithm correct? If its correct, is the cause of this non-stop running is due to my inefficient algorithm?

Here's my code and tables to illustrate:

Table 1

A	B
12	abc
13	def
28	ghi
50	jkl

Table 2 (Lookup to this table)

B	C	D
abc	4	7
def	3	3
ghi	6	2
jkl	8	1

Targeted result

A	B	C	D
12	abc	4	7
13	def	3	3
28	ghi	6	2
50	jkl	8	1

So the column of C and D will be added to table 1 also but lookup to table 2 of column B

This value on Table 1 is located in different CSV files, so I also loop through the files in the folder. I name the directory as all_files in the code. So, after looking up

My code:

df = pd.DataFrame()

for f in all_files:
    Table1 = pd.read_csv(all_files[f])
    for j in range(len(Table1)):
        u = Table1.loc[j,'B']
            for z in range(len(Table2)):
                if u == Table2.loc[z,'B']:
                    Table1.loc[j,'C'] = Table2.loc[z,'C']
                    Table1.loc[j,'D'] = Table2.loc[z,'D']
                break

    df = pd.concat([df,Table1],axis=0)

I used that break at the end just to stop the looping when it finds the same value and Table1 is concatenated to df. This code here didn't work on me, loops continuously and never stops.

Can anyone help? Any help will be very much appreciated!

CodePudding user response：

I hope this is the solution you are looking for:

Firstly I would join all the CSV for table_1 together as a single DataFrame. Then I would merge the table_2 to table_1 with the key of Column B. Sample code:

df = pd.DataFrame()
for file in all_file:
    df_tmp = pd.read_csv(file)
    df = pd.concat([df, df_tmp])
    
df_merge = pd.merge(df, table_2, on="B", how="left")

CodePudding user response：

When we use for-loop with Pandas, there is a 98% chance that we are doping it wrong. Pandas is design for you not to use loops.

Solutions with increasing performance:

import pandas as pd

table_1 = pd.DataFrame({'A':    [12, 13, 28, 50], 'B': ['abc', 'def', 'ghi','jkl']})

table_2 = pd.DataFrame({'B': ['abc', 'def', 'ghi','jkl'], 'C': [4, 3, 6, 8], 'D': [7, 3, 2, 1]})

# simple merge
table = pd.merge(table_1, table_2, how='inner', on='B')

# gain speed by using indexing

table_1 = table_1.set_index('B')
table_2 = table_2.set_index('B')

table = pd.merge(table_1, table_2, how='inner', left_index=True, right_index=True)

# there is also join but it’s slow than merge
table = table_1.join(table_2, on="B").reset_index()