How to store elements of a list of HTML tags fetched with BeautifulSoup within a dataframe separated-CodePudding

I am completely new to Jupiter Notebook, Python, Webscraping and stuff. I looked and different answers but no one seems to has the same problem (and I am not good in adapting "a similar" approach, change it a bit so I can use it for my purpose).

I want to create a data grid with all existing HTML tags. As source I am using MDN docs. It works find to get all Tags with Beautiful Soup but I struggle to go any further with this data.

Here is the code from fetching the data with beautiful soup

from bs4 import BeautifulSoup
import requests

url = "https://developer.mozilla.org/en-US/docs/Web/HTML/Element"

response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")

get_nav_tag = soup.find("nav", class_="sidebar-inner")
get_second_div = nav_tag.find_all("div")[2]
get_ol = get_second_div.find("ol")
get_li = get_second_div.find_all("li", class_="toggle")[3]

tag_list = get_li.find_all("code")

print("There are currently", len(tag_list), "tags.")

for tags in tag_list:
    print(tags.text)

The list is already sorted.

Now I work with Pandas to create a dataframe

import pandas as pd

tag_data = []
for tag in tag_list:
    tag_data.append({"Tags": tag.text})

df = pd.DataFrame(tag_data)
df

The output looks like

QUESTION

How do I create a dataframe where there are columns for each character and the elements are listed under each column?

Like:

  A         B      C 
1 <a>      <b>    <caption>
2 <abbr>   <body> <code>
3 <article> ..    ...
4 ...      ...    ...

How do I separate this list in more list corresponding to each elements first letter? I guess I will need it for further interactions as well, like creating graphs as such. E.g. to show in a bar chart, how many tags starting with "a", "b" etc exists.

Thank you!

CodePudding user response：

The code below should do the work.

df['first_letter'] = df.Tags.str[1]
tag_matrix = pd.DataFrame()

for letter in df.first_letter.unique():
    # Create a pandas series whose name matches the first letter of the tag and contains tags starting with the letter
    matching_tags = pd.Series(df[df.first_letter==letter].reset_index(drop=True).Tags, name=letter)
    # Append the series to the tag_matrix
    tag_matrix = pd.concat([tag_matrix, matching_tags], axis=1)    

tag_matrix

Here's a sample of the output:

Note that you might want to do some additional cleaning, such as dropping duplicate tags or converting to lower case.

CodePudding user response：

You can use pivot and concat methods to achieve this

df["letter"] = df["Tags"].str[1].str.upper()
df = df.pivot(columns="letter", values="Tags")
df = pd.concat([df[c].dropna().reset_index(drop=True) for c in df.columns], axis=1)

This gives