Using a regular expression to pull data from multiple dataframes-CodePudding

I am trying to use Regular expression to look at multiple contract specifications in a data frame to then use them in a filter. Below is a snippet of my code.

import pandas as pd
import re
Example_dict = {'S_1':pd.DataFrame({'Product1 Hours':'Lowest priced 5 concecutive hours in HE09 thru HE16',
                                    'Product2 Hours': 'Highest priced 4 consecutive hours in HE17 through HE22'},index=[1,2]).T.reset_index(),
                'S_2':pd.DataFrame({'Product1 Hours':'Lowest priced 5 concecutive hours in HE09 thru HE16',
                                    'Product2 Hours': 'Highest priced 4 consecutive hours in HE17 through HE23'},index=[1,2]).T.reset_index()
                }

So what I am looking to grab are the hours 09, 16 in one tuple or list and 17, 22 in the next list/tuple. The code I have written will pull the first line of HE for each so when I run this program it will pull 09 and 17. I am not able to find a way to "Jump" to the next occurrence of "HE". Please let me know if this makes sense - thank you

Edit

I wanted to update what my end goal dataframe/dictionary looks like. I am using a nested dictionary of data frames that has different spec information. I am curious how we could loop through these keys to get the hours of each key so for S_1 it would be (9,16), (17,22) and for S_2 it would be (09,16) , (17,23). Let me know if this clarifies the end goal

CodePudding user response：

You're applying the regex against str(Example_DF).

I assume you want to apply the regex to the entire DataFrame (every row, every cell), but str(Example_DF) is not a good reproduction of the entire DataFrame. Just evaluate it in python to see. When I evaluate it I see something like this:

            index                                                  1  \\\n0  Product1 
Hours  Lowest priced 5 concecutive hours in HE09 thru...   \n1  Product2 Hours  Highest
 priced 4 consecutive hours in HE17 thr...   \n\n  2  \n0  Lowest priced 5 concecutive 
hours in HE09 thru...  \n1  Highest priced 4 consecutive hours in HE17 thr...

Note the ellipses (...) indicating that the content is truncated. The str function for is summarizing the DataFrame rather than returning everything.

Instead of this you probably want to apply the regex to every cell. This will be faster (regex runtime doesn't scale linearly as length of content grows) and more organised (naturally segmented by row).

For example,

find = lambda x: [re.findall(r"HE(\d )|HE(/d )", z) for z in x]
Example_DF.apply(find)

CodePudding user response：

Assuming your input as a Series, you can use extractall, then groupby.agg as tuple.

s = pd.Series({'Product1 Hours':'Lowest priced 5 concecutive hours in HE09 thru HE16',
                           'Product2 Hours': 'Highest priced 4 consecutive hours in HE17 through HE22'})

out = s.str.extractall(r'HE(\d )')[0].groupby(level=0).agg(tuple)

Output:

Product1 Hours    (09, 16)
Product2 Hours    (17, 22)
Name: 0, dtype: object

NB. If the input really is a DataFrame, please clarify the expected output, however you can use the same logic per column with apply, or with stack/unstack.

CodePudding user response：

I ended up doing the following to get the list of hours from a list of dataframes in a dictionary.

for key, df in dict_Option_DataBase_Q1.items():
    ContractSpeces = df['ContractSpecs'].set_index('Parameter').loc[['Product1 Hours', 'Product2 Hours'], :].values
    Hours = tuple(re.findall(r'HE(\d )',str(text)) for text in ContractSpeces)

This outputs

(['09', '16'], ['17', '23'])
(['09', '16'], ['17', '23'])
(['09', '16'], ['17', '23'])
(['09', '16'], ['17', '23'])
(['09', '16'], ['17', '22'])
(['09', '16'], ['17', '22'])

Which was the desired goal, please let me know if there are glaring issues with this code as I will most likely move forward with this method.