Home > Blockchain >  extract strings from HTML tag pandas
extract strings from HTML tag pandas

Time:08-26

How do I extract the following strings using str.extract or regex or any efficient way using python pandas in this tags below

<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>
<a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>
<a href="http://vine.co" rel="nofollow">Vine - Make a Scene</a>
<a href="https://about.twitter.com/products/tweetdeck" rel="nofollow">TweetDeck</a>

am using:
.str.extract('(>[A-Za-z])<')

I want this output:
Twitter for iPhone
Twitter Web Client
Vine - Make a Scene
TweetDeck

CodePudding user response:

Thie might help:

import pandas as pd
lst = [
    ['<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>'],
    ['<a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>'],
    ['<a href="http://vine.co" rel="nofollow">Vine - Make a Scene</a>'],
    ['<a href="https://about.twitter.com/products/tweetdeck" rel="nofollow">TweetDeck</a>']
]

df = pd.DataFrame(lst, columns=['url'])
df['text'] = df['url'].str.extract(r'>(.*?)<')
print(df)

Output

                                                 url                 text
0  <a href="http://twitter.com/download/iphone" r...   Twitter for iPhone
1  <a href="http://twitter.com" rel="nofollow">Tw...   Twitter Web Client
2  <a href="http://vine.co" rel="nofollow">Vine -...  Vine - Make a Scene
3  <a href="https://about.twitter.com/products/tw...            TweetDeck
  • Related