I have a csv file with different columns, one of which is called 'names' and contains many different first names. I want to count how often each name appears in the csv file and after that I want to plot the 10 most common names (in a bar graph or something similar)
CodePudding user response:
The best way to achieve this is by creating a document-term matrix with the CountVectorizer library.
You must import your .csv file with pandas library
import pandas as pd
df = pd.read_csv('./your_table.csv', encoding=DATASET_ENCODING, usecols=DATASET_COLUMNS)
After that use the CountVectorizer to create a document term matrix.
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(stop_words='english')
your_table_cv = cv.fit_transform(df.your_column)
your_dtm = pd.DataFrame(your_table_cv.toarray(), columns=cv.get_feature_names_out())
CodePudding user response:
For the dataframe I use value_count with sort = True https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.value_counts.html?highlight=value_count For plot I use matplotlib.pyplot
Here is an example: enter image description here