Home > other >  Select values of rows across dataframes using python
Select values of rows across dataframes using python

Time:07-06

I have two dataframes, one is populated with barcodes and their product's prices across several supermarkets during the COVID pandemic, and the other one contains the actual product names and their barcodes along with a few category_id's etc, which are known (so I know which category_id to target in order to get a specific set of barcodes.)

the df_markets:

             BARCODE          AC    BFRESH  LIDL    SUPERM  
Date                            
2020-01-03  5201263086618   6.36    7.97    0   0   8.31             
2020-01-03  5201263086625   7.58    9.53    0   0   9.91    
2020-01-03  7322540574852   18.11   18.34   0   0   8.86    
2020-01-03  7322540647136   18.8    18.95   0   0   18.9    
2020-01-03  7322540587555   18.22   18.98   0   0   9.21  

and the df_bar:

    product_id  category_id   subcategory_name    BARCODE           name        
0   1.0             37.0       Adult diapers     5.201263e 12   Sani Pants N2 Medium 14 
1   2.0             37.0       Adult diapers     5.201263e 12   Sani Pants N3 Large 14  
2   3.0             37.0       Adult diapers     7.322541e 12   Tena Pants Plus Large 
3   4.0             37.0       Adult diapers     7.322541e 12   Tena Slip Super Large N4 
4   5.0             37.0       Adult diapers     7.322541e 12   Tena Pants Plus Extra Large 

What I've tried doing is first locate inside the df_bar all of the barcodes that have a category_id of 114.0 (which is of detergent) like that:

df_bar.loc[df_bar['category_id']==114.0]

and that produced a two-rows dataframe, so I know that this category_id exists, and then tried locating inside the df_markets dataframe all of the barcodes that much this search query:

df_markets.loc[df_markets['BARCODE'].isin(df_bar.loc[df_bar['category_id']==114.0])]

like that, but that produces an empty dataframe like that:


       BARCODE  AC  BFRESH  LIDL    SUPERM  
Date                            

so I am guessing the problem is that the df_markets has the barcodes in this format "5201263086618" while the df_bar has the barcodes in this format "5.201263e 12". Do you know what needs to be done to be able to search within the df_markets dataframe effectively? Or let me know if there's something wrong with my code.


A minor edit that helped me after Danila's suggestion was to turn my "query" into this :

df_bar['BARCODE'] = df_bar['BARCODE'].apply(lambda x: str(x)[:-2])
#above line changed the format
df_markets.loc[df_markets['BARCODE'].isin(df_bar.loc[df_bar['category_id']==114.0]['BARCODE'])]

so that it can take into consideration only the BARCODE column.

CodePudding user response:

just convert BARCODE column to string. example:

 df_bar['BARCODE'] = df_bar['BARCODE'].apply(lambda x: str(x)[:-2])

CodePudding user response:

You can try going for a parse!

file = pandas.read_excel("your_file.xlsx", 0) #  is the sheet index
for col, row in file.iterrows():
    if row.get('category_id') == 114.0:
        # do something

Note that this solution may not be optimized for a big dataframe, since it has to parse the entire dataset until it finds what you need

CodePudding user response:

Convert Barcode to similar format.

df_bar['BARCODE'] = int(float(df_bar['BARCODE']))

CodePudding user response:

Provided that the df_bar['BARCODE'] is a number in scientific notation you can use the following command before loading your dataset.

pd.set_option('display.float_format', lambda x: '%.3f' % x)

  • Related