Home > Software engineering >  Count how many times an object occurs in a list of a list within a DataFrame column
Count how many times an object occurs in a list of a list within a DataFrame column

Time:09-01

Say I have a DataFrame pd with a column called 'elements' which contains a list of a list of objects as shown below:

print(df2['elements'])

0       [Element B, Element Cr, Element Re]
1       [Element B, Element Rh, Element Sc]
2        [Element B, Element Mo, Element Y]
3       [Element Al, Element B, Element Lu]
4       [Element B, Element Dy, Element Os]

Name: elements, Length: 1763, dtype: object

I would like to count how many times each string within the whole column, so like in the example above the count for the string 'Element B' is 5 or for 'Element Mo' is 1.

I have tried setting up a dictionary below, but this just counts the each list instead of the strings that are in them.

elements_count_dict = {}
for entry in df2['elements']:
    for object in entry:
        if object in elements_count_dict:
            elements_count_dict[object]  = 1
        else:
            elements_count_dict[object] = 0

However, doing it this way the dictionary of tracks each individual character instead of the strings i.e. [ = 5 and ] = 5 and even after converting the column to string using df2['elements']to_string() it still doesn't work.

CodePudding user response:

Try as follows:

  1. First, use Series.replace to replace [ and ] in your strings with '' (regex: r\[|\]).
  2. Second, use Series.str.split to split the string on , (i.e. ,\s).
  3. Third, use Series.explode to put each item on its own row.
  4. Finally, apply Series.value_counts to return a pd.Series with a count for each item in your lists (in order of appearance).
import pandas as pd

data = {'elements': {0: '[Element B, Element Cr, Element Re]',
  1: '[Element B, Element Rh, Element Sc]',
  2: '[Element B, Element Mo, Element Y]',
  3: '[Element Al, Element B, Element Lu]',
  4: '[Element B, Element Dy, Element Os]'}}

df = pd.DataFrame(data)

counts = df.elements.str.replace(r'\[|\]','', regex=True)\
    .str.split(',\s').explode().value_counts()

print(counts)

Element B     5
Element Cr    1
Element Re    1
Element Rh    1
Element Sc    1
Element Mo    1
Element Y     1
Element Al    1
Element Lu    1
Element Dy    1
Element Os    1
Name: elements, dtype: int64

# to turn that into a dict, simply use:
# d = counts.to_dict()

CodePudding user response:

First use np.ravel to flatten the nested list coming from df.elements.to_list and then use collections.Counter instead of a loop:

import numpy as np
from collections import Counter

ravel = np.ravel(df.elements.to_list())
Counter(ravel)

CodePudding user response:

Here is one way to do it

may not be very elegant, but works

df['col1'].str.replace(r'\[|\]','', regex=True).str.split(',').explode().str.strip().to_frame().groupby('col1').value_counts()
col1
Element Al    1
Element B     5
Element Cr    1
Element Dy    1
Element Lu    1
Element Mo    1
Element Os    1
Element Re    1
Element Rh    1
Element Sc    1
Element Y     1
dtype: int64

Data Used

data={'col1': {0: '[Element B, Element Cr, Element Re]',
  1: '[Element B, Element Rh, Element Sc]',
  2: '[Element B, Element Mo, Element Y]',
  3: '[Element Al, Element B, Element Lu]',
  4: '[Element B, Element Dy, Element Os]'}}
df=pd.DataFrame(data)
df
  • Related