I have a dataframe like this:
df1 = pd.DataFrame({
"index": ["EXEC sp_delete_job", "exec sp_add_job", "something else","exec sp_add_jobserver"],
"index1": ["NaN", "NaN", "NaN", "exec sp_delete_job"],
"index2": ["EXEC sp_droplogin", "EXEC sp_delete_job", "NaN", "something else"],
"index3": ["EXEC sp_droplogin", "EXEC sp_delete_job", "exec sp_add_job", "exec sp_delete_job"]
})
df1.head()
index index1 index2 index3
0 EXEC sp_delete_job NaN EXEC sp_droplogin EXEC sp_droplogin
1 exec sp_add_job NaN EXEC sp_delete_job EXEC sp_delete_job
2 something else NaN NaN exec sp_add_job
3 exec sp_add_jobserver exec sp_delete_job something else exec sp_delete_job
What I want is to map the columns values with their description from this site https://docs.microsoft.com/en-us/sql/relational-databases/system-stored-procedures/system-stored-procedures-transact-sql?view=sql-server-ver15
So for example this value EXEC sp_droplogin
can be mapped with the description from here
https://docs.microsoft.com/en-us/sql/relational-databases/system-stored-procedures/sp-droplogin-transact-sql?view=sql-server-ver15
so the output will look like this:
index
0 Removes a SQL Server login. This prevents access to an instance of SQL Server under
that login name.
1 EXEC sp_delete_job
2 exec sp_add_job
3 exec sp_delete_job
4 exec sp_add_jobserver
And the same must be done with the other column values.
What is the best way to perform this? With BeautifulSoup?
Can you provide some ideas/direction/code etc?
CodePudding user response:
You could call a function for each index
entry and replace it with the results of a requests
beautifulsoup
lookup:
import pandas as pd
import requests
from bs4 import BeautifulSoup
def description(value):
name = value.split(' ')[1].replace('_', '-')
url = f"https://docs.microsoft.com/en-us/sql/relational-databases/system-stored-procedures/{name}-transact-sql?view=sql-server-ver15"
req = requests.get(url)
soup = BeautifulSoup(req.content, "html.parser")
div = soup.find('div', class_="content")
return [p.text for p in div.find_all('p')][3]
df = pd.DataFrame({
"index": ["EXEC sp_droplogin", "EXEC sp_delete_job", "exec sp_add_job", "exec sp_delete_job","exec sp_add_jobserver"],
})
df['index'] = df['index'].map(description)
print(df)
This would change your dataframe as follows:
index
0 Removes a SQL Server login. This prevents access to an instance of SQL Server under that login name.
1 Deletes a job.
2 Adds a new job executed by the SQL Agent service.
3 Deletes a job.
4 Targets the specified job at the specified server.
First take
value
e.g.EXEC sp_droplogin
and split it on the space. Then take the second partsp_droplogin
and replace any_
with-
which is needed for the URL.Create a suitable URL based on
name
.Use
requests.get()
to obtain the corresponding HTML from the Microsoft site.Locate a
<div class='content'>
which holds the description.Inside that div, locate all the
<p>
elements and extract the text for each. The fourth entry holds the required text. Return that.
If there are None
values, you would need to test for this and return a suitable value:
def description(value):
if value:
.........existing code......
else:
return "Not found"