I working on a Python project that has a DataFrame like this:
row1 = ['BBB', 'AAA', 'CCC']
row2 = ['CCC']
row3 = ['DDD', 'AAA']
row4 = ['DDD', 'BBB', 'AAA', 'EEE', 'CCC']
row5 = ['EEE', 'AAA', 'EEE', 'CCC']
data = {'List': [row1, row2, row3, row4, row5],
'Path_length': [3, 1, 2, 5, 4]}
df = pd.DataFrame(data)
which leads to:
| | List | Path_length |
| 0 | ['BBB', 'AAA', 'CCC'] | 3 |
| 1 | ['CCC'] | 1 |
| 2 | ['DDD','AAA'] | 2 |
| 3 | ['DDD','BBB', 'AAA', 'EEE', 'CCC'] | 5 |
| 4 | ['EEE', 'AAA', 'EEE', 'CCC'] | 4 |
And the task consists of generating the following DataFrame:
| | Content | Unique | Started | Middleway | Finished |
| 0 | AAA | 0 | 0 | 3 | 1 |
| 1 | BBB | 0 | 1 | 1 | 0 |
| 2 | CCC | 1 | 0 | 0 | 3 |
| 3 | DDD | 0 | 2 | 0 | 0 |
| 4 | EEE | 0 | 1 | 2 | 0 |
where the columns contain the following:
- Content: the elements found in the List
- Unique: the number of times that the element appears alone in the list
- Started: the number of times that the element appears at the beginning
- Finished: the number of times that the element appears at the end
- Middleway: the number of times that the element appears between the beginning and the end.
I kind of got a solution for this task, but since I used a lot of loop functions, I can't apply the algorithm to a larger database because the time processing is too high. Could you help me by suggesting a code that solves this task?
CodePudding user response:
Here's one approach:
You can mask
the paths where the length is 1 (replace values where the condition is True to an empty string ""
so that you can explode
the lists to count the values in them).
Then for those rows, count the "Started", "Middleway" and "Finished" values using value_counts
.
Then, for the rows where the length is 1, count the "Unique" values.
Finally, drop the dummy ""
index, sort
and rearrange the columns to get the desired outcome:
masked_df = df['List'].mask(df['Path_length']==1, '')
out = masked_df.explode().value_counts().to_frame().rename(columns={'List': 'Finished'}).rename_axis(index=['Content'])
out['Started'] = masked_df.str[0].value_counts()
out['Middleway'] = masked_df.str[1:-1].explode().value_counts()
out['Finished'] -= out[['Started','Middleway']].sum(axis=1)
out['Unique'] = df['List'].where(df['Path_length']==1, '').explode().value_counts()
out = out.drop(index='').fillna(0).astype(int).sort_index().reset_index()[['Content','Unique','Started','Middleway','Finished']]
Output:
Content Unique Started Middleway Finished
0 AAA 0 0 3 1
1 BBB 0 1 1 0
2 CCC 1 0 0 3
3 DDD 0 2 0 0
4 EEE 0 1 2 0
CodePudding user response:
Another solution. First we apply enumerate on the List column, explode it and then sum() boolean columns:
df = df.assign(List=df.List.apply(lambda x: [*enumerate(x, 1)])).explode("List")
df[["tmp", "Content"]] = df["List"].apply(pd.Series)
df["Unique"] = df.tmp.eq(1).eq(df.Path_length)
df["Started"] = df.tmp.eq(1) & ~df.Unique
df["Middleway"] = df.tmp.ne(1) & df.tmp.ne(df.Path_length)
df["Finished"] = df.tmp.eq(df.Path_length) & ~df.Unique
print(df.groupby("Content").sum().iloc[:, 2:].reset_index())
Prints:
Content Unique Started Middleway Finished
0 AAA 0 0 3 1
1 BBB 0 1 1 0
2 CCC 1 0 0 3
3 DDD 0 2 0 0
4 EEE 0 1 2 0