How to fill cell by cell of an empty pandas dataframe which has zero columns with a loop?-CodePudding

I need to scrape hundreds of pages and instead of storing the whole json of each page, I want to just store several columns from each page into a pandas dataframe. However, at the beginning when the dataframe is empty, I have a problem. I need to fill an empty dataframe without any columns or rows. So the loop below is not working correctly:

import pandas as pd
import requests


cids = [4100,4101,4102,4103,4104]
df = pd.DataFrame()

for i in cids:
    url_info = requests.get(f'myurl/{i}/profile')
    jdata = url_info.json()
    df['Customer_id'] = i
    df['Name'] = jdata['user']['profile']['Name']
    ...

In this case, what should I do?

CodePudding user response：

You can solve this by using enumerate(), together with loc:

for index, i in enumerate(cids):
    url_info = requests.get(f'myurl/{i}/profile')
    jdata = url_info.json()
    df.loc[index, 'Customer_id'] = i
    df.loc[index, 'Name'] = jdata['user']['profile']['Name']

CodePudding user response：

If you specify your column names when you create your empty dataframe, as follows: df = pd.DataFrame(columns = ['Customer_id', 'Name']) Then you can then just append your new data using: df = df.append({'Customer_id' : i, 'Name' : jdata['user']['profile']['Name']}, ignore_index=True) (plus any other columns you populate) then you can add a row to the dataframe for each iteration of your for loop.

import pandas as pd
import requests


cids = [4100,4101,4102,4103,4104]
df = pd.DataFrame(columns = ['Customer_id', 'Name'])

for i in cids:
    url_info = requests.get(f'myurl/{i}/profile')
    jdata = url_info.json()
    df = df.append({'Customer_id' : i, 'Name' : jdata['user']['profile']['Name']}, ignore_index=True)

It should be noted that using append on a DataFrame in a loop is usually inefficient (see here) so a better way is to save your results as a list of lists (df_data), and then turn that into a DataFrame, as below:

cids = [4100,4101,4102,4103,4104]
df_data = []

for i in cids:
    url_info = requests.get(f'myurl/{i}/profile')
    jdata = url_info.json()
    df_data.append([i, jdata['user']['profile']['Name']])
    
df = pd.DataFrame(df_data, columns = ['Customer_id', 'Name'])