How do I extract rows of a data frame with given conditions?-CodePudding

I'm a newbie in python. I have a data frame of over 1.6 million rows and 19 columns, containing numeric and categorical columns; concatenated from five data frames from 2016-2020 [df_16-df_20]. Hospital overall rating is the target, it contains ['Not Available', '1', '2', '3', '4', '5'] with no missing values. I want to return a new data frame that reduced entries by creating a condition: if a hospital name appears in all five data frames and have a rating, ranging from 1 to 5 i.e. range(1,6). Each hospital has only one rating; although, a hospital may have different ratings in different years. I want to retain only hospitals that are in all five data frames and whose rating is either 1 to 5, along with the corresponding rows of the remaining columns. 'SOUTHEAST ALABAMA MEDICAL CENTER' is the first facility name in the data frames. Here's my code:

    df = pd.concat([df_16, df_17, df_18, df_19, df_20], axis=0).reset_index(drop=True)

    years = [df_16, df_17, df_18, df_19, df_20]

    def sortRows(df):
      for hospital in df['Facility Name'].values:
        for x in df['Hospital overall rating'][hospital]:
          for i in range(len(years)):
            if hospital in years[i]:
              if x in range(1,6):
                return(df)

    df1 = sortRows(df)
    df1.shape

I have the following error message:

    KeyError                                  Traceback (most recent call last)
    <ipython-input-36-b220dfbf8cd1> in <module>()
    ----> 1 df1 = sortRows(df)
          2 df1.shape

    3 frames
    <ipython-input-35-90b9ef86e14c> in sortRows(df)
          2 def sortRows(df):
          3   for hospital in df['Facility Name'].values:
    ----> 4     for x in df['Hospital overall rating'][hospital]:
          5       for i in range(len(years)):
          6         if hospital in years[i]:

    /usr/local/lib/python3.7/dist-packages/pandas/core/series.py in __getitem__(self, key)
        940 
        941         elif key_is_scalar:
    --> 942             return self._get_value(key)
        943 
        944         if is_hashable(key):

    /usr/local/lib/python3.7/dist-packages/pandas/core/series.py in _get_value(self, label, takeable)
       1049 
       1050         # Similar to Index.get_value, but we do not fall back to positional
    -> 1051         loc = self.index.get_loc(label)
       1052         return self.index._get_values_for_loc(self, loc, label)
       1053 

    /usr/local/lib/python3.7/dist-packages/pandas/core/indexes/range.py in get_loc(self, key, method, tolerance)
        386                 except ValueError as err:
        387                     raise KeyError(key) from err
    --> 388             raise KeyError(key)
        389         return super().get_loc(key, method=method, tolerance=tolerance)
        390 

     KeyError: 'SOUTHEAST ALABAMA MEDICAL CENTER'

CodePudding user response：

Incomplete. Update the question to give missing information.

The exception you show in your answer is probably coming from one case where a hospital doesn't exist in every dataframe. You'll need to handle this by doing a "filter" type of operation instead of an unconditional lookup.

Something like "give all rows that match" is a filter. The .loc[] usage below is an example of that. (After filtering we can also check x.empty on the result to see if it's empty.)

I've gathered that your dataframes have at least two columns:

Facility Name
Hospital overall rating

I can help you with the first task. Let's find out if a given dataframe contains a hospital and what the rating is.

def hospital_ratings(df, hospital_name: str):
   """
   Return a pd.Series of all hospital ratings in the dataframe for a the given hospital name.
   """
   return df.loc[df['Facility Name'] == hospital_name, 'Hospital overall rating'].dropna()

With this you can go through the list of dataframes and check if the hospitals have ratings.

What I don't know: how do you represent missing data? Is .dropna() correct or is there another way to represent a missing rating? Are there multiple ratings per hospital?

And that's just the questions for this first step. Hope it helps. Make sure to compute these "partial results" and print them out, to understand what your program is doing.

for year_df in years:
    print(hospital_ratings(year_df, 'SOUTHEAST ALABAMA MEDICAL CENTER'))

CodePudding user response：

Main error alludes to this line where you are attempting to reference hospital as a named element to a single column.

df['Hospital overall rating'][hospital]

Additionally, an incorrect logic will occur at this line where you compare a string to number:

if hospital in years[i]:

However, avoid looping especially across 1.5 million rows by conditionally checking counts of hospital and year without 'Not Available' ratings. But first, assign a year column (if not already included) and then run groupby.transform (i.e., aggregation that keeps original rows) for logical filtering all inside DataFrame.loc:

hospitals_all_df = pd.concat(
    [
        df_16.assign(year=2016), 
        df_17.assign(year=2017)
        df_18.assign(year=2018)
        df_19.assign(year=2019)
        df_20.assign(year=2020)
    ], 
    ignore_index = True
).

hospitals_all_df["na_flag"] = (
     hospitals_all_df["Hospital overall rating"].eq("Not Availabile")
)

hospitals_complete_df = hospitals_all_df.loc[
    (hospitals_all_df.groupby("Facility Name")[year]
                     .transform("nunique").eq(5)) &
    (hospitals_all_df.groupby("Facility Name")["na_flag"]
                     .transform("max").eq(0))
]