Home > database >  How can I label a row with a keyword if another columns value contains a string from a list?
How can I label a row with a keyword if another columns value contains a string from a list?

Time:11-05

I am working on a dataframe that has a category of challenges for a class. I need to be able to identify them as either being 'linux','window' or 'primer' based. I have created a dictionary as so:

import pandas as pd

topic_keywords_dict = {
    'Linux': 
        {
            'identification':['linux'],
            'topic':
            [
                'bash','boot','process','auditing'
            ]},

    
    'Windows':
        {
            'identification':['windows','memory'],
            'topic':
            [
                'boot','process','artifacts','memory','active_directory','sysinternal'
        ]},
    'Primer':
    {
        'identification':['primer'],
        'topic':
        [
            'kernel','CLI','registry','process','NTFS','boot','auditing','security','active_directory','networking','surveys'
    ]}
}

and I have a dataframe that looks like this:

challenge_count_df = pd.DataFrame({'Challenge': ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J'],
                                  'Count' : [32, 22, 40, 12, 10, 60, 32, 22, 44, 90],
                                  'Value' : ["0","5","10","15","5","10","5","10","15","10"],
                                  'Category' : ['linux_bash','primer_02','windows_active_directory','basic_linux','linux_kitty','alpha_primer','windows_auditing','linux_logging', 'linux', 'primer']})

which would give me something like this:

>>> challenge_count_df
  Challenge  Count Value                  Category
0         A     32     0                linux_bash
1         B     22     5                 primer_02
2         C     40    10  windows_active_directory
3         D     12    15               basic_linux
4         E     10     5               linux_kitty
5         F     60    10              alpha_primer
6         G     32     5          windows_auditing
7         H     22    10             linux_logging
8         I     44    15                     linux
9         J     90    10                    primer

I was thinking of using something like this:

challenge_count_df[challenge_count_df['Category'].contains('|'.join(topic_keywords_dict[dict_key]['identification']))]

and maybe putting it in a form of applying lambda with the method above

challenge_count_df['key_dict'] = challenge_count_df['Category'].apply(lambda x: key_dict if x .contains('|'.join(topic_keywords_dict[dict_key]['identification'])) for key_dict in topic_keywords_dict)

but I'm thinking I'm doing the for loop inside the lambda wrong...can someone please help me understand what I'm doing wrong?

--------------------EDIT-----------------

The expected outcome would look like this:

>>> challenge_count_df
  Challenge  Count Value                  Category   key_dict
0         A     32     0                linux_bash   linux
1         B     22     5                 primer_02   primer
2         C     40    10  windows_active_directory   windows
3         D     12    15               basic_linux   linux
4         E     10     5               linux_kitty   linux
5         F     60    10              alpha_primer   primer
6         G     32     5          windows_auditing   windows
7         H     22    10             linux_logging   linux
8         I     44    15                     linux   linux
9         J     90    10                    primer   primer

CodePudding user response:

I would recommend just making a function: it's a little much for a one-liner. Try something like this:

def func(category):
    for platform, data in topic_keywords_dict.items():
        if any(x in category for x in data['identification']:
             return platform
    return None

df['key_dict'] = df['Category'].apply(func)

CodePudding user response:

Use Series.str.contains for identify matched rows and assign key of dictionary:

for k, v in topic_keywords_dict.items():
    m = challenge_count_df['Category'].str.contains('|'.join(v['identification']))
    challenge_count_df.loc[m, 'key_dict'] = k

print (challenge_count_df)
  Challenge  Count Value                  Category key_dict
0         A     32     0                linux_bash    Linux
1         B     22     5                 primer_02   Primer
2         C     40    10  windows_active_directory  Windows
3         D     12    15               basic_linux    Linux
4         E     10     5               linux_kitty    Linux
5         F     60    10              alpha_primer   Primer
6         G     32     5          windows_auditing  Windows
7         H     22    10             linux_logging    Linux
8         I     44    15                     linux    Linux
9         J     90    10                    primer   Primer
  • Related