Preventing row iteration in Python Pandas-CodePudding

I have converted the following file into a pandas df:

https://www.fca.org.uk/publication/data/position-limits-contract-names-vpc.xlsx

I have converted the relevant rows (for myself) into a dict. The dict is of the form {principal: [spot, aggregate, set(product codes)]}. I have used the following code to convert it into this dict:

ifeu_dict = defaultdict(lambda: [0, 0, set()])


for (_, row) in df.iterrows():
        if row.loc["Venue MIC"] == "IFEU":
            ifeu_dict[row.loc["Principal Venue Product Code"]][2].add(row.loc["Venue Product Codes"])
            if type(row.loc["Spot month single limit#"]) == int:
                # no need for append as default is to create a dict
                ifeu_dict[row.loc["Principal Venue Product Code"]][0] = row.loc["Spot month single limit#"]
                ifeu_dict[row.loc["Principal Venue Product Code"]][1] = row.loc["Other month limit#"]
            if type(row.loc["Spot month single limit#"]) == str:
                try:
                    val = int(str(row.loc["Spot month single limit#"]).split()[0].replace(",", ""))
                    val_2 = int(str(row.loc["Other month limit#"]).split()[0].replace(",", ""))
                    ifeu_dict[row.loc["Principal Venue Product Code"]][0] = val
                    ifeu_dict[row.loc["Principal Venue Product Code"]][1] = val_2
                except ValueError:
                    pass

However, this is really inefficient and so i have been attempting to change the way i am creating this dictionary.

One attempt is as follows:

ifeu_dict_2 = defaultdict(lambda: [0, 0, set()])

ifeu_mask = df["Venue MIC"] == "IFEU"
ifeu_df = df.loc[ifeu_mask]
spot_mask_int = ifeu_df["Spot month single limit#"].apply(type) == int


def spot_transform(x):
    try:
        return int(str(x).split()[0].replace(",", ""))
    except ValueError:
        return


ifeu_df["Spot month single limit#"] = ifeu_df.loc[~spot_mask_int, "Spot month single limit#"].apply(spot_transform)
ifeu_df["Other month limit#"] = ifeu_df.loc[~spot_mask_int, "Other month limit#"].apply(spot_transform)
spot_mask_int = ifeu_df["Spot month single limit#"].apply(type) == int

and then trying:

temp_df = [~spot_mask_int, ["Principal Venue Product Code", "Spot month single limit#", "Other month limit#"]]
ifeu_dict_2[temp_df.loc["Principal Venue Product Code"]][0] = temp_df.loc["Spot month single limit#"]

# this gives me AttributeError: 'list' object has no attribute 'loc'

or:

ifeu_dict_2[ifeu_df.loc[spot_mask_int, "Principal Venue Product Code"]][2].add(ifeu_df.loc["Venue Product Codes"])
ifeu_dict_2[ifeu_df.loc[spot_mask_int, "Principal Venue Product Code"]][0] = ifeu_df.loc[spot_mask_int, "Spot month single limit#"]
ifeu_dict_2[ifeu_df.loc[spot_mask_int, "Principal Venue Product Code"]][1] = ifeu_df.loc[spot_mask_int, "Other month limit#"]

# this gives me TypeError: 'Series' objects are mutable, thus they cannot be hashed

Been stuck for quite a while and not sure how to proceed. Any help would be massively appreciated, whether it be an answer or helpful links! (For links, I am new to coding so examples help me best atm).

If you would like a df to play with:

Index(['Commodity Derivative Name\n(including associated contracts)',
       'Venue MIC', 'Name of Trading Venue', 'Venue Product Codes',
       'Principal Venue Product Code', 'Spot month single limit#',
       'Other month limit#', 'Conversion Factor', 'Unit of measurement',
       'Definition of spot month'],
      dtype='object')

    API2 Rotterdam Coal Average Price Options (Futures Style Margin),IFEU,INTERCONTINENTAL EXCHANGE - ICE FUTURES EUROPE,RCA,ATW,Aggregated with Principal,Aggregated with Principal,nan,Lots,Calendar Month
 Gasoil Diff - Gasoil 50ppm FOB Rotterdam Barges vs Low Sulphur Gasoil 1st Line Future,IFEU,INTERCONTINENTAL EXCHANGE - ICE FUTURES EUROPE,ULH,ULH,2500,2500,nan,Lots,Calendar Month
 Marine Fuel 0.5% FOB Rotterdam Barges (Platts) Future,IFEU,INTERCONTINENTAL EXCHANGE - ICE FUTURES EUROPE,MF3,MF3,2500,2500,nan,Lots,Calendar Month
API2 Rotterdam Coal (supporting Cal 1x Options),IFEU,INTERCONTINENTAL EXCHANGE - ICE FUTURES EUROPE,ATC,ATW,Aggregated with Principal,Aggregated with Principal,nan,Lots,Calendar Month
API2 Rotterdam Coal (supporting Qtr 1x Options),IFEU,INTERCONTINENTAL EXCHANGE - ICE FUTURES EUROPE,ATQ,ATW,Aggregated with Principal,Aggregated with Principal,nan,Lots,Calendar Month
API2 Rotterdam Coal Cal 1x Options (Futures Style Margin),IFEU,INTERCONTINENTAL EXCHANGE - ICE FUTURES EUROPE,ATD,ATW,Aggregated with Principal,Aggregated with Principal,nan,Lots,Calendar Month
API2 Rotterdam Coal Early (122 days) Single Expiry Option (Futures Style Margin),IFEU,INTERCONTINENTAL EXCHANGE - ICE FUTURES EUROPE,RDE,ATW,Aggregated with Principal,Aggregated with Principal,nan,Lots,Calendar Month
API2 Rotterdam Coal Early (214 days) Single Expiry Option (Futures Style Margin),IFEU,INTERCONTINENTAL EXCHANGE - ICE FUTURES EUROPE,RDF,ATW,Aggregated with Principal,Aggregated with Principal,nan,Lots,Calendar Month
API2 Rotterdam Coal Early (305 days) Single Expiry Option (Futures Style Margin),IFEU,INTERCONTINENTAL EXCHANGE - ICE FUTURES EUROPE,RDG,ATW,Aggregated with Principal,Aggregated with Principal,nan,Lots,Calendar Month
API2 Rotterdam Coal Futures,IFEU,INTERCONTINENTAL EXCHANGE - ICE FUTURES EUROPE,ATW,ATW,5,550 (24.9%),38,800 (20.5%),nan,Lots,Calendar Month
API2 Rotterdam Coal Options (Futures Style Margin),IFEU,INTERCONTINENTAL EXCHANGE - ICE FUTURES EUROPE,RCO,ATW,Aggregated with Principal,Aggregated with Principal,nan,Lots,Calendar Month
API2 Rotterdam Coal Qtr 1x Options (Futures Style Margin),IFEU,INTERCONTINENTAL EXCHANGE - ICE FUTURES EUROPE,ATH,ATW,Aggregated with Principal,Aggregated with Principal,nan,Lots,Calendar Month

An entry in the finished dictionary should look like:

ATW = [5550, 38800, {'ATH', 'ATC', 'RDF', 'ATQ', 'RCA', 'ATD', 'RCO', 'RDG', 'RDE', 'ATW'}]

CodePudding user response：

Looking at the data I understand now. The data includes multiple codes for each product and you need to end up with a dict that has a single entry for each group of codes. Your way goes row by row but a more efficient way would be to use the DataFrame.groupby method and process each group in one go.

The following code should be more efficient than going row by row.

df_ifeu = df[df['Venue MIC ']=='IFEU']

ifeu_dict = {}
for principal,g in df_ifeu.groupby('Principal Venue Product Code'):
    # find where the product code is the same as the principal code
    pr = g['Venue Product Codes '] == principal
    # get the values for the principal
    spot_val = g.loc[pr, 'Spot month single limit#'].iloc[0]
    other_val = g.loc[pr, 'Other month limit#'].iloc[0]
    # get the codes
    codes = set(g['Venue Product Codes '])
    # add the product to the dict
    ifeu_dict[principal] = [spot_val, other_val, codes]

# confirm we have one dict entry per principal product code
assert(len(ifeu_dict)==df_ifeu['Principal Venue Product Code'].nunique())