How to create a touchpoint DataFrame from a list in pandas-CodePudding

I working on a Python project that has a DataFrame like this:

row1 = ['BBB', 'AAA', 'CCC']
row2 = ['CCC']
row3 = ['DDD', 'AAA']
row4 = ['DDD', 'BBB', 'AAA', 'EEE', 'CCC']
row5 = ['EEE', 'AAA', 'EEE', 'CCC']
data = {'List': [row1, row2, row3, row4, row5],
        'Path_length':  [3, 1, 2, 5, 4]}

df = pd.DataFrame(data)

which leads to:

|     |             List                   |   Path_length   | 
|  0  | ['BBB', 'AAA', 'CCC']              |       3         |  
|  1  | ['CCC']                            |       1         |   
|  2  | ['DDD','AAA']                      |       2         |   
|  3  | ['DDD','BBB', 'AAA', 'EEE', 'CCC'] |       5         |  
|  4  | ['EEE', 'AAA', 'EEE', 'CCC']       |       4         |

And the task consists of generating the following DataFrame:

|     |   Content       |   Unique    |  Started  |  Middleway  |  Finished  |
|  0  |     AAA         |     0       |     0     |      3      |      1     | 
|  1  |     BBB         |     0       |     1     |      1      |      0     |
|  2  |     CCC         |     1       |     0     |      0      |      3     |
|  3  |     DDD         |     0       |     2     |      0      |      0     |
|  4  |     EEE         |     0       |     1     |      2      |      0     |

where the columns contain the following:

Content: the elements found in the List
Unique: the number of times that the element appears alone in the list
Started: the number of times that the element appears at the beginning
Finished: the number of times that the element appears at the end
Middleway: the number of times that the element appears between the beginning and the end.

I kind of got a solution for this task, but since I used a lot of loop functions, I can't apply the algorithm to a larger database because the time processing is too high. Could you help me by suggesting a code that solves this task?

CodePudding user response：

Here's one approach:

You can mask the paths where the length is 1 (replace values where the condition is True to an empty string "" so that you can explode the lists to count the values in them).

Then for those rows, count the "Started", "Middleway" and "Finished" values using value_counts.

Then, for the rows where the length is 1, count the "Unique" values.

Finally, drop the dummy "" index, sort and rearrange the columns to get the desired outcome:

masked_df = df['List'].mask(df['Path_length']==1, '')
out = masked_df.explode().value_counts().to_frame().rename(columns={'List': 'Finished'}).rename_axis(index=['Content'])
out['Started'] = masked_df.str[0].value_counts()
out['Middleway'] = masked_df.str[1:-1].explode().value_counts()
out['Finished'] -= out[['Started','Middleway']].sum(axis=1)
out['Unique'] = df['List'].where(df['Path_length']==1, '').explode().value_counts()
out = out.drop(index='').fillna(0).astype(int).sort_index().reset_index()[['Content','Unique','Started','Middleway','Finished']]

Output:

  Content  Unique  Started  Middleway  Finished
0     AAA       0        0          3         1
1     BBB       0        1          1         0
2     CCC       1        0          0         3
3     DDD       0        2          0         0
4     EEE       0        1          2         0

CodePudding user response：

Another solution. First we apply enumerate on the List column, explode it and then sum() boolean columns:

df = df.assign(List=df.List.apply(lambda x: [*enumerate(x, 1)])).explode("List")
df[["tmp", "Content"]] = df["List"].apply(pd.Series)

df["Unique"] = df.tmp.eq(1).eq(df.Path_length)
df["Started"] = df.tmp.eq(1) & ~df.Unique
df["Middleway"] = df.tmp.ne(1) & df.tmp.ne(df.Path_length)
df["Finished"] = df.tmp.eq(df.Path_length) & ~df.Unique

print(df.groupby("Content").sum().iloc[:, 2:].reset_index())

Prints:

  Content  Unique  Started  Middleway  Finished
0     AAA       0        0          3         1
1     BBB       0        1          1         0
2     CCC       1        0          0         3
3     DDD       0        2          0         0
4     EEE       0        1          2         0