I'm trying to make a code to check the data update, and update it if needs to be.
The problem is the efficiency. I only got a idea of nested loop, and there should be a better way to do this.
there are two DataFrames; df_new, df_old. I want to update df_old's data with newer one. Also, I want to make a ChangeLog if there's a change. (and if not, just a timestamp).
Here's my sample code:
import pandas as pd
from datetime import datetime
df_new = pd.DataFrame({"id":[11,22,33,44,55], "a2":[2,3,8,9,99], "a3":[2,4,2,5,99]})
df_old = pd.DataFrame({"id":[11,22,33,44], "a2":[2,3,4,7], "a3":[2,2,2,7],"CHANGELOG":["","","",""]})
for row in df_new.itertuples():
flag = 0
for row2 in df_old.itertuples():
if row[1] == row2[1]:
p = str(datetime.now().date()) "\n"
if row[2] != row2[2]:
p = "a2 : " str(row[2]) " -> " str(row2[2]) "\n"
df_old.at[row2[0],"a2"] = str(row[2])
if row[3] != row2[3]:
p = "a3 : " str(row[3]) " -> " str(row2[3]) "\n"
df_old.at[row2[0],"a3"] = str(row[3])
df_old.at[row2[0],"CHANGELOG"] = p
flag = 1
break
if flag == 0:
df_old = df_old.append(pd.DataFrame([row],columns = row._fields),ignore_index=True)
df_old.at[len(df_old)-1,"CHANGELOG"] = str(datetime.now().date()) "\n" "Created"
The code actually worked. But only with small datasets. if I run with tens of thousands rows (each), as you've already assumed, it takes too much time.
I've searched that there's pd.compare, but seems like it only works with two dataframes with same rows/columns. And... I'm stuck now.
Are there any functions or references that I can use?
Thank you in advance.
CodePudding user response:
Indeed, as stated in the docs, pd.compare
"[c]an only compare identically-labeled (i.e. same shape, identical row and column labels) DataFrames. So, let's first achieve this:
import pandas as pd
from datetime import date
df_old = pd.DataFrame({"id":[11,22,33,44],
"a2":[2,3,4,7],
"a3":[2,2,2,7],
"CHANGELOG":["","","",""]})
df_new = pd.DataFrame({"id":[11,22,33,44,55],
"a2":[2,3,8,9,99],
"a3":[2,4,2,5,99]})
# get slice of `df_old` with just columns that need comparing
df_slice = df_old.iloc[:,:3]
# get missing indices from `df_new`, build empty df and append to `df_slice`
missing_indices = set(df_new.index).difference(set(df_slice.index))
df_missing = pd.DataFrame(columns = df_slice.columns, index=missing_indices)
df_slice = pd.concat([df_slice,df_missing],axis=0)
print(df_slice)
id a2 a3
0 11 2 2
1 22 3 2
2 33 4 2
3 44 7 7
4 NaN NaN NaN
Now, we can use pd.compare
:
# compare the two dfs, keep_shape=True: same rows remain in result
comp = df_slice.compare(df_new, keep_shape=True)
print(comp)
id a2 a3
self other self other self other
0 NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN 2 4.0
2 NaN NaN 4 8.0 NaN NaN
3 NaN NaN 7 9.0 7 5.0
4 NaN 55.0 NaN 99.0 NaN 99.0
Finally, let's apply a custom function to the comp
df to generate the strings for the column CHANGELOG
. Something like below:
# create func to build changelog strings per row
def create_change_log(row: pd.Series) -> str:
"""
Parameters
----------
row : pd.Series
e.g. comp.iloc[1] with ('id','self') etc. as index
Returns
-------
str
change_log_string per row.
"""
# start string for each row
string = str(date.today()) "\n"
# get length pd.Series
length = len(row)
# get range for loop over 'self', so index 0,2,4 if len == 6
self_iloc = [*range(0,length,2)]
# get level 0 from index to retrieve orig col names: ['id'] etc.
cols = row.index.get_level_values(0)
# for loop, check scenarios
for i in self_iloc:
temp = str()
# both 'self' and 'other' vals are NaN, nothing changed
if row.isna().all():
break
# all of 'self' == NaN, entire row is new
if row.iloc[self_iloc].isna().all():
temp = 'Created\n'
string = temp
break
# set up comp for specific cols: comp 'self.1' == 'other.1' etc.
self_val, other_val = row[i], row[i 1]
# add `pd.notna()`, since np.nan == np.nan is actually `False`!
if self_val != other_val and pd.notna(self_val):
temp = f'{cols[i]} : {self_val} -> {other_val}\n'
string = temp
return string
Applied to comp
:
change_log = comp.apply(lambda row: create_change_log(row), axis=1)
change_log.name = 'CHANGELOG'
# result
df_old_adj = pd.concat([df_new,change_log],axis=1)
print(df_old_adj)
id a2 a3 CHANGELOG
0 11 2 2 2022-08-29\n
1 22 3 4 2022-08-29\na3 : 2 -> 4.0\n
2 33 8 2 2022-08-29\na2 : 4 -> 8.0\n
3 44 9 5 2022-08-29\na2 : 7 -> 9.0\na3 : 7 -> 5.0\n
4 55 99 99 2022-08-29\nCreated\n
PS.1: my result has e.g. 2022-08-29\na3 : 2 -> 4.0\n
where you generate 2022-08-29\na3 : 4 -> 2\n
. The former seems to me correct; you want to convey: value 2
in column a3
has become (->
) 4
, no? Anyway, you can just switch the vars in {self_val} -> {other_val}
, of course.
PS.2: comp
is turning ints
into floats
automatically for other
(= df_new
). Hence, we end up with 2 -> 4.0
rather than 2 -> 4
. I'd say the best solution to 'fix' this depends on the type of values you are expecting.