Home > OS >  Generate DF from attributes of tags in list
Generate DF from attributes of tags in list

Time:12-15

I have a list of revisions from a Wikipedia article that I queried like this:

import urllib
import re

def getRevisions(wikititle):
    url = "https://en.wikipedia.org/w/api.php?action=query&format=xml&prop=revisions&rvlimit=500&titles=" wikititle 
    revisions = []                                        #list of all accumulated revisions
    next = ''                                             #information for the next request

    while True:
        response = urllib.request.urlopen(url   next).read()     #web request

        response = str(response)

        revisions  = re.findall('<rev [^>]*>', response)  #adds all revisions from the current request to the list

        cont = re.search('<continue rvcontinue="([^"] )"', response)
        if not cont:                                      #break the loop if 'continue' element missing
            break

        next = "&rvcontinue="   cont.group(1)             #gets the revision Id from which to start the next request
    return revisions    

Which results in a list with each element being a rev Tag as a string:

['<rev revid="343143654" parentid="6546465" minor="" user="name" timestamp="2021-12-12T08:26:38Z" comment="abc" />',...]

How can I get generate a DF from this list

CodePudding user response:

An "easy" way without using regex would be splitting the string and then parsing:

for rev_string in revisions:
    rev_dict = {}

    # Skipping the first and last as it's the tag.
    attributes = rev_string.split(' ')[1:-1]

    #Split on = and take each value as key and value and convert value to string to get rid of excess ""
    for attribute in attributes:
        key, value = attribute.split("=")            
        rev_dict[key] = str(value) 
    
    df = pd.DataFrame.from_dict(rev_dict)

This sample would create one dataframe per revision. If you would like to gather multiple reivsions in one dictionary then you handle unique attributes (I don't know if these are changing depending on wiki-document) and then after gathering all attributes in the dictionary you convert to a DataFrame.

CodePudding user response:

Use output format of json then you can easily create data fram from Json

Example URL for JSON output

For json to dataframe help check out this stackoverflow query

  • Related