Home > Back-end >  Filtering grouped dataset by index column
Filtering grouped dataset by index column

Time:10-12

I'm trying to get a Pandas exercise done and it's driving me bonkers.

I have a dataset containing the number of cyclists that went by a certain zone of the city every hour of each day, so something like this:

Year Month Day Hour Zone 1 Zone 2 Zone 3
2014 1 1 0:00 2 0 5
2014 1 1 1:00 3 1 2
2014 1 1 2:00 4 1 1

et cetera. There are much many more rows and columns. The "zone" columns contain how many cyclists were recorded for that zone at that time.

The exercise asks to group this dataframe by year, month, and day, and then take the sum on the grouped dataframe. I do that like this:

grouped = data.groupby(["Year", "Month", "Day"]).sum()

where 'data' is the original, ungrouped dataframe. The resulting dataframe has tuples in the index columns, as the exercise text says it should. Printing grouped.head() returns this: enter image description here

(I dropped the "Hour" column because the exercise says so.) I verified that the index column contains indeed tuples by printing grouped.index, and it looks like this: [(2014,1,1), (2014,1,2), ...]

This is all good, but then the exercise asks to filter this dataframe so that only records from August 2017 are shown. I know I can do that by doing

grouped.filter(some-function-here)

but the problem is, I am having a hard time understanding how I can filter based on the index column (which doesn't have a name and can't be referred to as you can to others, eg grouped["Auroransilta"]), especially because I'm not sure if I'm doing tuple comparison correctly. For example, I tried this way

grouped.filter(lambda x: x > (2014, 1, 1) for x in grouped.index)

and I got this:

enter image description here

Variations of that approach all result in an empty dataframe. Thinking I just was doing something wrong with the tuples, I tried to filter by some other column:

grouped.filter(lambda x: x["Baana"] > 300 for x in grouped)

and that too resulted in the exact same empty dataframe. (The column "Baana" isn't in the screenshot but it is in the dataframe, and yes, there are rows with a count larger than 300). If I omit the for-loop, I get a TypeError saying that I'm not passing an iterable, so I guess it needs to be there even though I don't fully understand why (I thought filter would just apply the function I pass to every group in grouped.)

I have no idea how to fix this as I don't understand what I'm doing wrong.

CodePudding user response:

Use partial string indexing, with DatetimeIndex:

df['datetime'] = pd.to_datetime(df[["Year", "Month", "Day"]])
df = df.drop(["Year", "Month", "Day","Hour"], axis=1)
print (df)
   Zone 1  Zone 2  Zone 3   datetime
0       2       0       5 2014-01-01
1       3       1       2 2014-01-01
2       4       1       1 2014-01-01

df = df.groupby(["datetime"]).sum()
print (df)
            Zone 1  Zone 2  Zone 3
datetime                          
2014-01-01       9       2       8


df = df['2014-08']
print (df)
Empty DataFrame
Columns: [Zone 1, Zone 2, Zone 3]
Index: []

For filtering is used boolean indexing:

df = df[df["Banana"] > 300]

CodePudding user response:

If you want to use index to filter your dataframe, you can use Index.get_level_values:

grouped.loc[(grouped.index.get_level_values('Year') == 2017)
            & (grouped.index.get_level_values('Month') == 8)]

filter does not filter values but index labels. Use indexing instead like [], loc or .iloc.

CodePudding user response:

You can also filter by tuple as you did, by using .loc instead, as follows:

grouped.loc[grouped.reset_index(level=2).index == (2017, 8)]
  • Related