col_1 | col_2 |
---|---|
struct_1 | AA |
struct_bent_22 | AA |
sound_1 | BB |
sound_type_1 | BB |
I want to represent col_1 as numbers in a way that it retains the variability of the text. Any suggestions on how to do this ? If this is not a good idea, then any suggestions would be helpful.
expected output:
col_1 | col_2 |
---|---|
123 | AA |
1233 | AA |
12345 | BB |
123456 | BB |
Those numbers obviously do not capture what the text means, but I need a solution that captures the features of the text. (If possible)
CodePudding user response:
While hashing always involves the risk of collision, I might look at:
import hashlib
def str_to_int(text):
return int(hashlib.sha256(text.encode("utf-8")).hexdigest(), 16)
print(str_to_int("struct_1"))
print(str_to_int("struct_bent_22"))
and see if that works for you.
it should give you:
67086348026381773031976814245037562760713897031875549849861239415057698757195
24493846661393758293236341956602133508998612831345007873168413426300746343739