I am working on a dataframe that has a category of challenges for a class. I need to be able to identify them as either being 'linux','window' or 'primer' based. I have created a dictionary as so:
import pandas as pd
topic_keywords_dict = {
'Linux':
{
'identification':['linux'],
'topic':
[
'bash','boot','process','auditing'
]},
'Windows':
{
'identification':['windows','memory'],
'topic':
[
'boot','process','artifacts','memory','active_directory','sysinternal'
]},
'Primer':
{
'identification':['primer'],
'topic':
[
'kernel','CLI','registry','process','NTFS','boot','auditing','security','active_directory','networking','surveys'
]}
}
and I have a dataframe that looks like this:
challenge_count_df = pd.DataFrame({'Challenge': ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J'],
'Count' : [32, 22, 40, 12, 10, 60, 32, 22, 44, 90],
'Value' : ["0","5","10","15","5","10","5","10","15","10"],
'Category' : ['linux_bash','primer_02','windows_active_directory','basic_linux','linux_kitty','alpha_primer','windows_auditing','linux_logging', 'linux', 'primer']})
which would give me something like this:
>>> challenge_count_df
Challenge Count Value Category
0 A 32 0 linux_bash
1 B 22 5 primer_02
2 C 40 10 windows_active_directory
3 D 12 15 basic_linux
4 E 10 5 linux_kitty
5 F 60 10 alpha_primer
6 G 32 5 windows_auditing
7 H 22 10 linux_logging
8 I 44 15 linux
9 J 90 10 primer
I was thinking of using something like this:
challenge_count_df[challenge_count_df['Category'].contains('|'.join(topic_keywords_dict[dict_key]['identification']))]
and maybe putting it in a form of applying lambda with the method above
challenge_count_df['key_dict'] = challenge_count_df['Category'].apply(lambda x: key_dict if x .contains('|'.join(topic_keywords_dict[dict_key]['identification'])) for key_dict in topic_keywords_dict)
but I'm thinking I'm doing the for loop inside the lambda wrong...can someone please help me understand what I'm doing wrong?
--------------------EDIT-----------------
The expected outcome would look like this:
>>> challenge_count_df
Challenge Count Value Category key_dict
0 A 32 0 linux_bash linux
1 B 22 5 primer_02 primer
2 C 40 10 windows_active_directory windows
3 D 12 15 basic_linux linux
4 E 10 5 linux_kitty linux
5 F 60 10 alpha_primer primer
6 G 32 5 windows_auditing windows
7 H 22 10 linux_logging linux
8 I 44 15 linux linux
9 J 90 10 primer primer
CodePudding user response:
I would recommend just making a function: it's a little much for a one-liner. Try something like this:
def func(category):
for platform, data in topic_keywords_dict.items():
if any(x in category for x in data['identification']:
return platform
return None
df['key_dict'] = df['Category'].apply(func)
CodePudding user response:
Use Series.str.contains
for identify matched rows and assign key of dictionary:
for k, v in topic_keywords_dict.items():
m = challenge_count_df['Category'].str.contains('|'.join(v['identification']))
challenge_count_df.loc[m, 'key_dict'] = k
print (challenge_count_df)
Challenge Count Value Category key_dict
0 A 32 0 linux_bash Linux
1 B 22 5 primer_02 Primer
2 C 40 10 windows_active_directory Windows
3 D 12 15 basic_linux Linux
4 E 10 5 linux_kitty Linux
5 F 60 10 alpha_primer Primer
6 G 32 5 windows_auditing Windows
7 H 22 10 linux_logging Linux
8 I 44 15 linux Linux
9 J 90 10 primer Primer