how to compare / update with two diffrent Dataframes?-CodePudding

I'm trying to make a code to check the data update, and update it if needs to be.

The problem is the efficiency. I only got a idea of nested loop, and there should be a better way to do this.

there are two DataFrames; df_new, df_old. I want to update df_old's data with newer one. Also, I want to make a ChangeLog if there's a change. (and if not, just a timestamp).

Here's my sample code:

import pandas as pd
from datetime import datetime

df_new = pd.DataFrame({"id":[11,22,33,44,55], "a2":[2,3,8,9,99], "a3":[2,4,2,5,99]})
df_old = pd.DataFrame({"id":[11,22,33,44], "a2":[2,3,4,7], "a3":[2,2,2,7],"CHANGELOG":["","","",""]})


for row in df_new.itertuples():
    flag = 0
    for row2 in df_old.itertuples():
        if row[1] == row2[1]:
            p = str(datetime.now().date())   "\n"
            if row[2] != row2[2]:
                p  =  "a2 : "   str(row[2])   " -> "   str(row2[2])   "\n"
                df_old.at[row2[0],"a2"] = str(row[2])
                
            if row[3] != row2[3]:
                p  =  "a3 : "   str(row[3])   " -> "   str(row2[3])   "\n"
                df_old.at[row2[0],"a3"] = str(row[3])
                
            df_old.at[row2[0],"CHANGELOG"] = p
            flag = 1
            break
        
    if flag == 0:
        
        df_old = df_old.append(pd.DataFrame([row],columns = row._fields),ignore_index=True)
        df_old.at[len(df_old)-1,"CHANGELOG"] = str(datetime.now().date())   "\n"   "Created"

The code actually worked. But only with small datasets. if I run with tens of thousands rows (each), as you've already assumed, it takes too much time.

I've searched that there's pd.compare, but seems like it only works with two dataframes with same rows/columns. And... I'm stuck now.

Are there any functions or references that I can use?

Thank you in advance.

CodePudding user response：

Indeed, as stated in the docs, pd.compare "[c]an only compare identically-labeled (i.e. same shape, identical row and column labels) DataFrames. So, let's first achieve this:

import pandas as pd
from datetime import date

df_old = pd.DataFrame({"id":[11,22,33,44], 
                       "a2":[2,3,4,7], 
                       "a3":[2,2,2,7],
                       "CHANGELOG":["","","",""]})

df_new = pd.DataFrame({"id":[11,22,33,44,55], 
                       "a2":[2,3,8,9,99], 
                       "a3":[2,4,2,5,99]})

# get slice of `df_old` with just columns that need comparing
df_slice = df_old.iloc[:,:3]

# get missing indices from `df_new`, build empty df and append to `df_slice`
missing_indices = set(df_new.index).difference(set(df_slice.index))
df_missing = pd.DataFrame(columns = df_slice.columns, index=missing_indices)
df_slice = pd.concat([df_slice,df_missing],axis=0)

print(df_slice)
    id   a2   a3
0   11    2    2
1   22    3    2
2   33    4    2
3   44    7    7
4  NaN  NaN  NaN

Now, we can use pd.compare:

# compare the two dfs, keep_shape=True: same rows remain in result
comp = df_slice.compare(df_new, keep_shape=True)

print(comp)

    id         a2         a3      
  self other self other self other
0  NaN   NaN  NaN   NaN  NaN   NaN
1  NaN   NaN  NaN   NaN    2   4.0
2  NaN   NaN    4   8.0  NaN   NaN
3  NaN   NaN    7   9.0    7   5.0
4  NaN  55.0  NaN  99.0  NaN  99.0

Finally, let's apply a custom function to the comp df to generate the strings for the column CHANGELOG. Something like below:

# create func to build changelog strings per row
def create_change_log(row: pd.Series) -> str:
    """
    Parameters
    ----------
    row : pd.Series
        e.g. comp.iloc[1] with ('id','self') etc. as index

    Returns
    -------
    str
        change_log_string per row.
    """
    
    # start string for each row
    string = str(date.today())   "\n"
    
    # get length pd.Series
    length = len(row)
    
    # get range for loop over 'self', so index 0,2,4 if len == 6
    self_iloc = [*range(0,length,2)]
    
    # get level 0 from index to retrieve orig col names: ['id'] etc.
    cols = row.index.get_level_values(0)
    
    # for loop, check scenarios
    for i in self_iloc:
        temp = str()
        
        # both 'self' and 'other' vals are NaN, nothing changed
        if row.isna().all():
            break
        
        # all of 'self' == NaN, entire row is new
        if row.iloc[self_iloc].isna().all():
            temp = 'Created\n'
            string  = temp
            break
        
        # set up comp for specific cols: comp 'self.1' == 'other.1' etc.
        self_val, other_val = row[i], row[i 1]
        
        # add `pd.notna()`, since np.nan == np.nan is actually `False`!
        if self_val != other_val and pd.notna(self_val):
            temp = f'{cols[i]} : {self_val} -> {other_val}\n'
            string  = temp
    return string

Applied to comp:

change_log = comp.apply(lambda row: create_change_log(row), axis=1)
change_log.name = 'CHANGELOG'

# result
df_old_adj = pd.concat([df_new,change_log],axis=1)
print(df_old_adj)

   id  a2  a3                                   CHANGELOG
0  11   2   2                                2022-08-29\n
1  22   3   4                 2022-08-29\na3 : 2 -> 4.0\n
2  33   8   2                 2022-08-29\na2 : 4 -> 8.0\n
3  44   9   5  2022-08-29\na2 : 7 -> 9.0\na3 : 7 -> 5.0\n
4  55  99  99                       2022-08-29\nCreated\n

PS.1: my result has e.g. 2022-08-29\na3 : 2 -> 4.0\n where you generate 2022-08-29\na3 : 4 -> 2\n. The former seems to me correct; you want to convey: value 2 in column a3 has become (->) 4, no? Anyway, you can just switch the vars in {self_val} -> {other_val}, of course.

PS.2: comp is turning ints into floats automatically for other (= df_new). Hence, we end up with 2 -> 4.0 rather than 2 -> 4. I'd say the best solution to 'fix' this depends on the type of values you are expecting.