Home > Blockchain >  Pandas for loop with iterrows() and naming of dataframes
Pandas for loop with iterrows() and naming of dataframes

Time:06-12

I have a big dataframe, a sample of this df is like as follows:

etf_list = pd.DataFrame({'ISIN':['LU1737652583', 'IE00B44T3H88', 'IE0005042456', 'IE00B1FZS574', 'IE00BYMS5W68'],
                     'ETF_Vendor':['Amundi', 'HSBC', 'iShares', 'iShares', 'Invesco']})

In my local folder 'ETF/Input/', among many other files, the files IE00B1FZS574.csv and IE0005042456.csv are stored.

I would like to create a dataframe by reading the csv files, but only each iteration if the ETF_Vendor in etf_list equals 'iShares'. So I wrote the following for loop:

iShares = [] 
for i, row in etf_list.iterrows():
    if row['ETF_Vendor'] == 'iShares':
        ISIN = row['ISIN']
        iShares.append(ISIN)  # At each iteration, the list is filled with the ISINs for the relevant dataframes
        # Assign downloaded file the name of the relevant ISIN
        df[row['ISIN']] = 'ETF/Input/'   row['ISIN']   '.csv'
        # Define file as DataFrame, again specifying the ISIN as the name for the DataFrame.
        df[row['ISIN']] = pd.read_csv(df[row['ISIN']], sep=',', skiprows=2, thousands='.', decimal=',')
    else:
        pass
  

The problem with this loop is that the dataframes named like df['IE00B1FZS574']. But I want the dataframes to be named like the ISIN, so like e.g. IE00B1FZS574

How do I have to change my code in order to name the dataframes as e.g. IE00B1FZS574 instead of df['IE00B1FZS574']?

TY in advance.

CodePudding user response:

There are a couple of ways to go about it

Let's say you read the data as in your question. Here I'm storing each dataframe in a dict called dataframes. Orderly and Pythonic, so far so good

import pandas as pd

dataframes = {}
for i, row in something_you_have:  # Your details
    name = row['ISIN']
    dataframes[name] = pd.read_csv(....)

Now we can access the dataframes using dataframes['IE00B1FZS574'] and so on.

How to make this a bit more fluent?

A. Keep the dataframes in the dict. This is also an alternative.

B. We can use a namespace

import types

datans = types.SimpleNamespace(**dataframes)

datans.IE00B1FZS574

With the namespace we can access items from the previous dicts as just attributes on the namespace. Of course the keys in the dict need to be valid python identifiers. So datans.IE00B1FZS574 works here.

C. We can add items from the dataframes dict directly into the current module-global namespace.

When is this appropriate? In a notebook maybe. Some would say this is bad style.

# update the "globals" (current module namespace) with the dict
globals().update(dataframes)

IE00B1FZS574

Now we can access the dataframes using just IE00B1FZS574 etc in the current module.

In my analyses I usually go with option A but could consider option B to be good too. Normally avoid C. The reason is that the analysis should be maintainable and somewhat agile - data is data - the analysis should be data-driven and easy to update when the dataset has slight changes.

  • Related