python beginner performance : for loop Vs iterator Vs Pandas-CodePudding

for learning purpose, i got a bank statement as csv file :

data = pd.read_csv('./datasets/code/analyse/operations.csv')
data.columns = ['identifiant_transaction', 'date_operation', 'date_valeur', 'libelle', 'debit', 'credit', 'solde']
print(data.libelle.head())

witch display like this :

0    FORFAIT COMPTE SUPERBANK XX XX XX XX
1                 CARTE XX XX CHEZ LUC XX
2          PRELEVEMENT XX TELEPHONE XX XX
3                 CARTE XX XX XX XX XX XX
4                       CARTE XX XX XX XX
Name: libelle, dtype: object

my goal is to extract the most commun words use in the "libelle" :

XX          142800
CARTE        24700
VIREMENT      2900
ROBINSON      2000
ANCIENS       2000

i first try :

def most_common_words(labels):
    words = []
    for lab in labels:
        words  = lab.split(" ")
    return Counter(words).most_common()

then :

def most_common_words_iter(labels):
    return Counter(chain(*(words.split(" ") for words in labels))).most_common()

and finally :

def most_common_words_pandas(labels):    
    return labels.str.split().explode().value_counts(sort=True)

my hypothesis was that the first solution will be slower because of the intermediate list, and that the second or the third solution would perhaps induce some free integrated optimizations (vectorization, better flow management, less memory allocation ...). but, no :-/

python vs iterable vs pandas

Is it as it should be? Or should I do it differently?

CodePudding user response：

I got some improvement (30-40%) by modyfing "python" version:

def most_common_words(labels):
    words = ' '.join(labels.values)
    words = words.split(' ')
    return Counter(words).most_common()