How do I extract the following strings using str.extract or regex or any efficient way using python pandas in this tags below
<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>
<a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>
<a href="http://vine.co" rel="nofollow">Vine - Make a Scene</a>
<a href="https://about.twitter.com/products/tweetdeck" rel="nofollow">TweetDeck</a>
am using:
.str.extract('(>[A-Za-z])<')
I want this output:
Twitter for iPhone
Twitter Web Client
Vine - Make a Scene
TweetDeck
CodePudding user response:
Thie might help:
import pandas as pd
lst = [
['<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>'],
['<a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>'],
['<a href="http://vine.co" rel="nofollow">Vine - Make a Scene</a>'],
['<a href="https://about.twitter.com/products/tweetdeck" rel="nofollow">TweetDeck</a>']
]
df = pd.DataFrame(lst, columns=['url'])
df['text'] = df['url'].str.extract(r'>(.*?)<')
print(df)
Output
url text
0 <a href="http://twitter.com/download/iphone" r... Twitter for iPhone
1 <a href="http://twitter.com" rel="nofollow">Tw... Twitter Web Client
2 <a href="http://vine.co" rel="nofollow">Vine -... Vine - Make a Scene
3 <a href="https://about.twitter.com/products/tw... TweetDeck