for learning purpose, i got a bank statement as csv file :
data = pd.read_csv('./datasets/code/analyse/operations.csv')
data.columns = ['identifiant_transaction', 'date_operation', 'date_valeur', 'libelle', 'debit', 'credit', 'solde']
print(data.libelle.head())
witch display like this :
0 FORFAIT COMPTE SUPERBANK XX XX XX XX
1 CARTE XX XX CHEZ LUC XX
2 PRELEVEMENT XX TELEPHONE XX XX
3 CARTE XX XX XX XX XX XX
4 CARTE XX XX XX XX
Name: libelle, dtype: object
my goal is to extract the most commun words use in the "libelle" :
XX 142800
CARTE 24700
VIREMENT 2900
ROBINSON 2000
ANCIENS 2000
i first try :
def most_common_words(labels):
words = []
for lab in labels:
words = lab.split(" ")
return Counter(words).most_common()
then :
def most_common_words_iter(labels):
return Counter(chain(*(words.split(" ") for words in labels))).most_common()
and finally :
def most_common_words_pandas(labels):
return labels.str.split().explode().value_counts(sort=True)
my hypothesis was that the first solution will be slower because of the intermediate list, and that the second or the third solution would perhaps induce some free integrated optimizations (vectorization, better flow management, less memory allocation ...). but, no :-/
Is it as it should be? Or should I do it differently?
CodePudding user response:
I got some improvement (30-40%) by modyfing "python" version:
def most_common_words(labels):
words = ' '.join(labels.values)
words = words.split(' ')
return Counter(words).most_common()