I'm trying to add a column to each row of a dataframe which includes a hash value of the row values.
I originally tried this:
df['hash'] = pd.Series((hash(tuple(row)) for _, row in df_to_hash.iterrows()))
However, when I ran this on two different DataFrames, I was encountering an issue when the column names didn't exactly match.
For example:
DF1:
Name Age
0 Tom 12
1 Pat 15
DF1:
FirstName Age
0 Tom 12
1 Pat 15
When I hashed the above DataFrames, row 0
in each dataframe had a different value due to the columns being different.
Is there a way I can has the row values only, excluding the columns?
I also tried this with no success:
df['hash'] = df_to_hash.apply(lambda x: hash(tuple(x)), axis=1)
CodePudding user response:
What about using the underlying numpy array:
pd.Series((hash(tuple(row)) for row in df_to_hash.to_numpy()))
Output:
0 2606281096150585092
1 -1842928179554038127
dtype: int64
You can also use pandas.util.hash_pandas_object
with index=False
:
pd.util.hash_pandas_object(df_to_hash, index=False)
Output:
0 17445307237601047733
1 15658167368827391476
dtype: uint64