Home > database >  Create multiple columns from single column with complex logic
Create multiple columns from single column with complex logic

Time:06-09

I am bit confused by all the apply, applymap, map stuff for Dataframes and/or Series. I want to create multiple columns derived from one column in a dataframe through a function which does some webscraping stuff.

My dataframe looks like this

>>> df
          row1        url    row3
0        data1  http://...    123
1        data2  http://...    325
2        data3  http://...    346

the webscraping function is like this

def get_stuff_from_url(url: str):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    data1 = soup.find('div', {'class': 'stuff1'})
    data2 = soup.find('span', {'class', 'stuff2'}).text
    data3 = soup.find('p', {'class', 'stuff3'}).text

    return data1, data2, data3

The result should be

>>> df_new
          row1        url    row3       row4       row5       row6
0        data1  http://...    123  newdata1a  newdata2a  newdata3a
1        data2  http://...    325  newdata1b  newdata2b  newdata3b
2        data3  http://...    346  newdata1c  newdata2c  newdata3c

where newdata1 comes from data1 and so on.

My previous attempt (where get_stuff_from_url only returned one value) was

df_new = df_old['url'].apply(lambda row: get_stuff_from_url(row))

but this seems wrong and I can't extend this to multiple columns output. Any ideas to solve this in the way how it is meant to be?

CodePudding user response:

Problem. We have a df that contains a column with urls. We want to create a soup for each of these urls, then return 3 values from the created soup and populate the rows of 3 new columns with the returned values.

Solution. Here's a simplification of your function:

def get_stuff_from_url(url: str):
    # response = requests.get(url)
    # soup = BeautifulSoup(response.text, 'html.parser')
    data1 = '<div ><p>Stuff</p></div>'
    data2 = "Hello world"
    data3 = "Right back at you, sir!"

    return data1, data2, data3

This function returns multiple values. If we assign it to one variable, this variable will now contain a tuple. Suppose we wrote:

df_new = pd.DataFrame(df['url'].apply(lambda row: get_stuff_from_url(row)))

Then we would end up with a df with just 1 column, each row containing the same tuple: ('<div ><p>Stuff</p></div>', 'Hello world', 'Right back at you, sir!').

If we want to populate multiple columns with the elems from the tuple, we can use zip(*iterables), where we use the * operator to unzip the tuple passed to zip().

To create a new df using this method you could do:

df_new = pd.DataFrame(zip(*df['url'].apply(lambda row: get_stuff_from_url(row)))).T

Result:

                                        0            1                        2
0  <div ><p>Stuff</p></div>  Hello world  Right back at you, sir!
1  <div ><p>Stuff</p></div>  Hello world  Right back at you, sir!
2  <div ><p>Stuff</p></div>  Hello world  Right back at you, sir!

If you simply want to add the data to your existing df, you could do:

df['data1'], df['data2'], df['data3'] = zip(*df['url'].apply(lambda row: get_stuff_from_url(row)))

Let's print the first row to see what we end up with (print(df.iloc[0])):

row1                                      data1
url                                  http://...
row3                                        123
data1    <div ><p>Stuff</p></div>
data2                               Hello world
data3                   Right back at you, sir!
Name: 0, dtype: object

CodePudding user response:

You could create a dict in your def and use .join() to .apply the series:

df.join(df.url.apply(lambda x: pd.Series(get_stuff_from_url(x))))

So we use the value of url column for each row to call the get_stuff_from_url(), while pd.series() helps us to unpack the returned dict to following DataFrame:

data1 data2 data3
0 stuff1 stuff2 stuff3
1 stuff1 stuff2 stuff3
2 stuff1 stuff2 stuff3

Now a simple df.join() is sufficient to fit our needs and put both DataFrames together and final result.

row1 url row3 data1 data2 data3
0 data1 http 123 stuff1 stuff2 stuff3
1 data2 http 325 stuff1 stuff2 stuff3
2 data3 http 346 stuff1 stuff2 stuff3
Example

Just to demonstrate how it works, simply use your inital def and adapt it to store the scraped data in your dict.

import pandas as pd

df = pd.DataFrame({'row1':['data1','data2','data3'],
                   'url':['http','http','http'],
                   'row3':[123,325,346]
                  })

def get_stuff_from_url(url: str):
    # response = requests.get(url)
    # soup = BeautifulSoup(response.text, 'html.parser')
    data = {
        'data1': 'stuff1', #soup.find('div', {'class': 'stuff1'})
        'data2': 'stuff2', #soup.find('span', {'class', 'stuff2'}).text
        'data3': 'stuff3'  #soup.find('p', {'class', 'stuff3'}).text
    }
    return data

df.join(df.url.apply(lambda x: pd.Series(get_stuff_from_url(x))))
  • Related