Pandas Dataframe explode List, add new columns and count values-CodePudding

I'm a little bit stuck. I habe a Dataframe with a list in a column.

id	list
1	[a, b]
2	[a,a,a,b]
3	c,b,b
4	[c,a]
5	[f,f,b]

I have the values, a, b, c, d, e, f in general. I want to count if two values are in a list togehter and also if a value appears more than once in that list.

I want to get that to create a heatmap, with all values in x and y axis. and the counts where e.g. a is x times in a list with itself or e.g. a and b are x times togehter.

I tried this so far, but it is not exactly the solution i want.

Make ne columns and count values

df['a'] = df['list'].explode().str.contains('a').groupby(level=0).any().astype('int')
df['b'] = df['list'].explode().str.contains('b').groupby(level=0).any().astype('int')
df['c'] = df['list'].explode().str.contains('c').groupby(level=0).any().astype('int')
df['d'] = df['list'].explode().str.contains('d').groupby(level=0).any().astype('int')
df['e'] = df['list'].explode().str.contains('e').groupby(level=0).any().astype('int')
df['f'] = df['list'].explode().str.contains('f').groupby(level=0).any().astype('int')

here i get the first problem, i create a new df with rows names the list and counting the values in the list, but I also get the count if i only have the value once in the list

make x axis

df_explo = df.explode(['list'],ignore_index=True)

get sum of all

df2=df_explo.groupby(['list']).agg({'a':'sum','b':'sum','c':'sum','d':'sum','e':'sum','f':'sum').reset_index()

set index to list

df3 = df2.set_index('list')

create heatmap

sns.heatmap(df3,cmap='RdYlGn_r', linewidths=0.5,annot=True,fmt="d")

CodePudding user response：

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from collections import Counter
from itertools import combinations

data = [
    ['a', 'b'],
    ['a', 'a', 'a', 'b'],
    ['b', 'b', 'b'],
    ['c', 'a'],
    ['f', 'f', 'b']
]

letters = ['a', 'b', 'c', 'd', 'e', 'f']

duplicate_occurrences = pd.DataFrame(0, index=[0], columns=letters)
co_occurrences = pd.DataFrame(0, index=letters, columns=letters)

for l in data:
    duplicates = [k for k, v in Counter(l).items() if v > 1]
    for d in duplicates:
        duplicate_occurrences[d]  = 1
    co = list(combinations(set(l), 2))
    for a, b in co:
        co_occurrences.loc[a, b]  = 1
        co_occurrences.loc[b, a]  = 1
    

plt.figure(figsize=(7, 1))
sns.heatmap(duplicate_occurrences, cmap='RdYlGn_r', linewidths=0.5, annot=True, fmt="d")
plt.title('Duplicate Occurrence Counts')
plt.show()

sns.heatmap(co_occurrences, cmap='RdYlGn_r', linewidths=0.5, annot=True, fmt="d")
plt.title('Co-Occurrence Counts')
plt.show()

The first plot shows how often each letter occurs at least twice in a list, the second shows how often each pair of letters occurs together in a list.

In case you want to plot the duplicate occurrences on the diagonal, you could do it e.g. as follows:

df = pd.DataFrame(0, index=letters, columns=letters)
for l in data:
    for k, v in Counter(l).items():
        if v > 1:
            df.loc[k, k]  = 1
    for a, b in combinations(set(l), 2):
        df.loc[a, b]  = 1
        df.loc[b, a]  = 1
sns.heatmap(df, cmap='RdYlGn_r', linewidths=0.5, annot=True, fmt="d")