Home > database >  How to match dict keys with dataframe column entries to insert the values
How to match dict keys with dataframe column entries to insert the values

Time:06-28

I used text mining to capture word counts from various documents. In doing so, I search a given list of words in the texts. Example: data, digital & wifi

My code outputs a dictionary with the matched word (key) and count (value) as a result. Example:

txt_file_1: {'data': 2, 'digital': 1}
txt_file_2: {'data': 3}

Now I want to transfer that into a dataframe by listing all the words I searched for in the first column. The other columns should output the txt_file and the corresponding matches as header (see table below).

My question is: How do I match the keys from the dictionary with the related entries in the first column so that the values end up correctly?

words txt_file_1 txt_file_2
data 2 3
digital 1 0
wifi 0 0

CodePudding user response:

A pd.Series is quite similar to a dictionary. It has an index (dictionary keys) and a value for each key.

In this solution, we use these basic facts about Series and pandas automatic index alignment when handling Series and DataFrames. We just produce Series with the right indexes and they combine naturally.

A DataFrame is like a side-by-side collection of Series, one per column.

# given
txt_file_1 = {'data': 2, 'digital': 1}
txt_file_2 = {'data': 3}

# input dictionaries with names
inputs = [("file1", txt_file_1), ("file2", txt_file_2)]

words = ['data', 'digital', 'wifi']

# concatenate together the columns. It merges the indexes automatically
# But then we align the dataframe to the known word list.

df = pd.concat([pd.Series(item, name=name) for name, item in inputs], axis="columns")
df = df.reindex(words).fillna(0).astype('int')
df
         file1  file2
data         2      3
digital      1      0
wifi         0      0

Note that .reindex() expands the DataFrame if it was missing words - but it can also lose rows if you have words in each column that are not in the wordlist. You can use other code to do a union, or check for this error before proceeding.


There is the option of using pd.Series(item, name=name, index=words) already when making the Series. Reindex is then not needed. The only downside I see is that you can't see if there are words other than those in the index in the dictionary, but if you already know this by construction, then it it's no problem.

  • Related