I am writing a function that selects a subset of rows from a pandas DataFrame.
The function looks like this,
def get_predictions(df: pd.DataFrame, subset: str) -> pd.DataFrame:
return df['properties', 'prediction'].loc[subset]
I would like this function to be able to handle the case where I want to select all of the rows in the DataFrame. One solution to this is to make the subset argument default to None and return the entire DataFrame if the subset argument is set to None.
def get_predictions(df: pd.DataFrame, subset: str) -> pd.DataFrame:
if subset is None:
return df['properties', 'prediction']
else:
return df['properties', 'prediction'].loc[subset]
I don't like this solution because I am duplicating a lot of code. Is there a better solution that does not involve duplication. Specifically, is there an object that I could pass into .loc[]
which would return all of the rows in the DataFrame?
This is the ideal solution that I am looking for,
def get_predictions(df: pd.DataFrame, subset=MysteryObject) -> pd.DataFrame:
return df['properties', 'prediction'].loc[MysteryObject]
Is there a MysteryObject
that could achieve this desired behavior?
CodePudding user response:
just pass in
subset = df.index
Also, it is better practice to subset both the rows and columns using .loc. That way, you get a view into the subset, rather than generating a copy of the columns first. so just do
df.loc[subset, ['properties', 'prediction']]
CodePudding user response:
Let's try setting the default to slice(None)
instead of just None
:
def get_predictions(
df: pd.DataFrame, subset: str = slice(None)
) -> pd.DataFrame:
return df[['properties', 'prediction']].loc[subset]
Although it would be even better practice to subset both axes in one step:
def get_predictions(
df: pd.DataFrame, subset: str = slice(None)
) -> pd.DataFrame:
return df.loc[subset, ['properties', 'prediction']]
slice(None)
is the equivalent to :
with the exception that it can be assigned to a variable.
df.loc[:, 'col'] == df.loc[slice(None), 'col']
Test Code:
test_df = pd.DataFrame({'properties': [1, 2, 3],
'prediction': [4, 5, 6],
'other': [7, 8, 9]},
index=['a', 'b', 'c'])
print('Subset \'a\'')
print(get_predictions(test_df, 'a'))
print('No Subset')
print(get_predictions(test_df))
Output:
Subset 'a'
properties 1
prediction 4
Name: a, dtype: int64
No Subset
properties prediction
a 1 4
b 2 5
c 3 6