I am bit confused by all the apply, applymap, map stuff for Dataframes and/or Series. I want to create multiple columns derived from one column in a dataframe through a function which does some webscraping stuff.
My dataframe looks like this
>>> df
row1 url row3
0 data1 http://... 123
1 data2 http://... 325
2 data3 http://... 346
the webscraping function is like this
def get_stuff_from_url(url: str):
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
data1 = soup.find('div', {'class': 'stuff1'})
data2 = soup.find('span', {'class', 'stuff2'}).text
data3 = soup.find('p', {'class', 'stuff3'}).text
return data1, data2, data3
The result should be
>>> df_new
row1 url row3 row4 row5 row6
0 data1 http://... 123 newdata1a newdata2a newdata3a
1 data2 http://... 325 newdata1b newdata2b newdata3b
2 data3 http://... 346 newdata1c newdata2c newdata3c
where newdata1 comes from data1 and so on.
My previous attempt (where get_stuff_from_url
only returned one value) was
df_new = df_old['url'].apply(lambda row: get_stuff_from_url(row))
but this seems wrong and I can't extend this to multiple columns output. Any ideas to solve this in the way how it is meant to be?
CodePudding user response:
Problem. We have a df that contains a column with urls. We want to create a soup for each of these urls, then return 3 values from the created soup and populate the rows of 3 new columns with the returned values.
Solution. Here's a simplification of your function:
def get_stuff_from_url(url: str):
# response = requests.get(url)
# soup = BeautifulSoup(response.text, 'html.parser')
data1 = '<div ><p>Stuff</p></div>'
data2 = "Hello world"
data3 = "Right back at you, sir!"
return data1, data2, data3
This function returns multiple values. If we assign it to one variable, this variable will now contain a tuple. Suppose we wrote:
df_new = pd.DataFrame(df['url'].apply(lambda row: get_stuff_from_url(row)))
Then we would end up with a df with just 1 column, each row containing the same tuple: ('<div ><p>Stuff</p></div>', 'Hello world', 'Right back at you, sir!')
.
If we want to populate multiple columns with the elems from the tuple, we can use zip(*iterables), where we use the *
operator to unzip the tuple passed to zip()
.
To create a new df using this method you could do:
df_new = pd.DataFrame(zip(*df['url'].apply(lambda row: get_stuff_from_url(row)))).T
Result:
0 1 2
0 <div ><p>Stuff</p></div> Hello world Right back at you, sir!
1 <div ><p>Stuff</p></div> Hello world Right back at you, sir!
2 <div ><p>Stuff</p></div> Hello world Right back at you, sir!
If you simply want to add the data to your existing df, you could do:
df['data1'], df['data2'], df['data3'] = zip(*df['url'].apply(lambda row: get_stuff_from_url(row)))
Let's print the first row to see what we end up with (print(df.iloc[0])
):
row1 data1
url http://...
row3 123
data1 <div ><p>Stuff</p></div>
data2 Hello world
data3 Right back at you, sir!
Name: 0, dtype: object
CodePudding user response:
You could create a dict
in your def
and use .join()
to .apply
the series:
df.join(df.url.apply(lambda x: pd.Series(get_stuff_from_url(x))))
So we use the value of url
column for each row to call the get_stuff_from_url()
, while pd.series()
helps us to unpack the returned dict
to following DataFrame
:
data1 | data2 | data3 | |
---|---|---|---|
0 | stuff1 | stuff2 | stuff3 |
1 | stuff1 | stuff2 | stuff3 |
2 | stuff1 | stuff2 | stuff3 |
Now a simple df.join()
is sufficient to fit our needs and put both DataFrames
together and final result.
row1 | url | row3 | data1 | data2 | data3 | |
---|---|---|---|---|---|---|
0 | data1 | http | 123 | stuff1 | stuff2 | stuff3 |
1 | data2 | http | 325 | stuff1 | stuff2 | stuff3 |
2 | data3 | http | 346 | stuff1 | stuff2 | stuff3 |
Example
Just to demonstrate how it works, simply use your inital def
and adapt it to store the scraped data in your dict
.
import pandas as pd
df = pd.DataFrame({'row1':['data1','data2','data3'],
'url':['http','http','http'],
'row3':[123,325,346]
})
def get_stuff_from_url(url: str):
# response = requests.get(url)
# soup = BeautifulSoup(response.text, 'html.parser')
data = {
'data1': 'stuff1', #soup.find('div', {'class': 'stuff1'})
'data2': 'stuff2', #soup.find('span', {'class', 'stuff2'}).text
'data3': 'stuff3' #soup.find('p', {'class', 'stuff3'}).text
}
return data
df.join(df.url.apply(lambda x: pd.Series(get_stuff_from_url(x))))