Home > front end >  Map different column values with website context
Map different column values with website context

Time:05-05

I have a dataframe like this:

df1 = pd.DataFrame({   
    "index": ["EXEC sp_delete_job",  "exec sp_add_job", "something else","exec sp_add_jobserver"],
    "index1": ["NaN",  "NaN",  "NaN", "exec sp_delete_job"],
    "index2": ["EXEC sp_droplogin",  "EXEC sp_delete_job",  "NaN", "something else"],
    "index3": ["EXEC sp_droplogin",  "EXEC sp_delete_job",  "exec sp_add_job", "exec sp_delete_job"]
})
df1.head()

      index                 index1                  index2          index3
0   EXEC sp_delete_job       NaN                EXEC sp_droplogin   EXEC sp_droplogin
1   exec sp_add_job          NaN                EXEC sp_delete_job  EXEC sp_delete_job
2   something else           NaN                  NaN                   exec sp_add_job
3   exec sp_add_jobserver    exec sp_delete_job   something else    exec sp_delete_job

What I want is to map the columns values with their description from this site https://docs.microsoft.com/en-us/sql/relational-databases/system-stored-procedures/system-stored-procedures-transact-sql?view=sql-server-ver15

So for example this value EXEC sp_droplogin can be mapped with the description from here https://docs.microsoft.com/en-us/sql/relational-databases/system-stored-procedures/sp-droplogin-transact-sql?view=sql-server-ver15

so the output will look like this:

    index
    0   Removes a SQL Server login. This prevents access to an instance of SQL Server under 
 that login name.
    1   EXEC sp_delete_job
    2   exec sp_add_job
    3   exec sp_delete_job
    4   exec sp_add_jobserver

And the same must be done with the other column values.

What is the best way to perform this? With BeautifulSoup?

Can you provide some ideas/direction/code etc?

CodePudding user response:

You could call a function for each index entry and replace it with the results of a requests beautifulsoup lookup:

import pandas as pd
import requests
from bs4 import BeautifulSoup

def description(value):   
    name = value.split(' ')[1].replace('_', '-')
    url = f"https://docs.microsoft.com/en-us/sql/relational-databases/system-stored-procedures/{name}-transact-sql?view=sql-server-ver15"
    req = requests.get(url)
    soup = BeautifulSoup(req.content, "html.parser")
    div = soup.find('div', class_="content")
    return [p.text for p in div.find_all('p')][3]


df = pd.DataFrame({   
    "index": ["EXEC sp_droplogin",  "EXEC sp_delete_job",  "exec sp_add_job", "exec sp_delete_job","exec sp_add_jobserver"],
})


df['index'] = df['index'].map(description)
print(df)

This would change your dataframe as follows:

                                                                                                  index
0  Removes a SQL Server login. This prevents access to an instance of SQL Server under that login name.
1                                                                                        Deletes a job.
2                                                     Adds a new job executed by the SQL Agent service.
3                                                                                        Deletes a job.
4                                                    Targets the specified job at the specified server.
  1. First take value e.g. EXEC sp_droplogin and split it on the space. Then take the second part sp_droplogin and replace any _ with - which is needed for the URL.

  2. Create a suitable URL based on name.

  3. Use requests.get() to obtain the corresponding HTML from the Microsoft site.

  4. Locate a <div class='content'> which holds the description.

  5. Inside that div, locate all the <p> elements and extract the text for each. The fourth entry holds the required text. Return that.

If there are None values, you would need to test for this and return a suitable value:

def description(value):
    if value:
        .........existing code......
    else:
        return "Not found"
  • Related