Home > OS >  Filter data frame on multiple conditions
Filter data frame on multiple conditions

Time:07-09

I’d like to filter on multiple conditions within a data frame. For instance, find dates where Open > 10 and Close < 50 within the next n days of a Open > 10 date.

import yfinance as yf

data = yf.download(‘spy’, start=‘1990-01-01’ end=‘2000-01-01’)

data[(data[‘Open’] > 10) & (data[‘Close’] < 50 in data[‘Date’] to data[‘Date’]   7)]

How can I filter a dataframe on correlated dates with a condition?

CodePudding user response:

If I understand correctly, you need prices greater than 10, but less than 50, after the seventh day of the start of the data? Use straight quotes and there should be a comma after start in the query itself.

import yfinance as yf

data = yf.download('spy', start='1990-01-01', end='2000-01-01')
df = data[(data['Open'] > 10) & (data['Close'] < 50) & (data.index > data.index[6])]

print(df)

CodePudding user response:

Here's How I would do this.

 `data = yf.download('spy', start='1990-01-01', end='2000-01-01')`

Output:

print(data)
                 Open       High         Low       Close  Adj Close   Volume
Date                                                                        
1993-01-29   43.96875   43.96875   43.750000   43.937500  25.438084  1003200
1993-02-01   43.96875   44.25000   43.968750   44.250000  25.619017   480500
1993-02-02   44.21875   44.37500   44.125000   44.343750  25.673300   201300
1993-02-03   44.40625   44.84375   44.375000   44.812500  25.944689   529400
1993-02-04   44.96875   45.09375   44.468750   45.000000  26.053246   531500
              ...        ...         ...         ...        ...      ...
1999-12-27  146.50000  146.78125  145.062500  146.281250  96.697540  2691000
1999-12-28  145.87500  146.50000  145.484375  146.187500  96.635559  4084500
1999-12-29  146.31250  146.81250  145.312500  146.812500  97.048737  3001000
1999-12-30  147.12500  147.56250  146.187500  146.640625  96.935066  3641300
1999-12-31  146.84375  147.50000  146.250000  146.875000  97.090034  3172700

[1749 rows x 6 columns]
  1. Filter out the data to only get the rows where 'Open' > 10 and 'Close' < 50

    df = data[(data['Open'] > 10) & (data['Close'] < 50)]

Output:

print(df)
                 Open      High        Low      Close  Adj Close   Volume
Date                                                                     
1993-01-29  43.968750  43.96875  43.750000  43.937500  25.438084  1003200
1993-02-01  43.968750  44.25000  43.968750  44.250000  25.619017   480500
1993-02-02  44.218750  44.37500  44.125000  44.343750  25.673300   201300
1993-02-03  44.406250  44.84375  44.375000  44.812500  25.944689   529400
1993-02-04  44.968750  45.09375  44.468750  45.000000  26.053246   531500
              ...       ...        ...        ...        ...      ...
1995-03-17  49.437500  49.62500  49.406250  49.562500  30.364269    89900
1995-03-20  49.625000  49.62500  49.468750  49.562500  30.364269    91700
1995-03-21  49.562500  49.87500  49.359375  49.437500  30.287670   104400
1995-03-22  49.531250  49.53125  49.328125  49.484375  30.316399    74900
1995-03-23  49.421875  49.65625  49.359375  49.515625  30.335543   220500

[543 rows x 6 columns]
  1. Create a column that shifts the dates, essentially creating a column that contains the following date in that fitlered dataframe

    df = df.reset_index(drop=False)

    df['Next_Date'] = df['Date'].shift(-1)

Output:

print(df)
          Date       Open      High  ...  Adj Close   Volume  Next_Date
0   1993-01-29  43.968750  43.96875  ...  25.438084  1003200 1993-02-01
1   1993-02-01  43.968750  44.25000  ...  25.619017   480500 1993-02-02
2   1993-02-02  44.218750  44.37500  ...  25.673300   201300 1993-02-03
3   1993-02-03  44.406250  44.84375  ...  25.944689   529400 1993-02-04
4   1993-02-04  44.968750  45.09375  ...  26.053246   531500 1993-02-05
..         ...        ...       ...  ...        ...      ...        ...
538 1995-03-17  49.437500  49.62500  ...  30.364269    89900 1995-03-20
539 1995-03-20  49.625000  49.62500  ...  30.364269    91700 1995-03-21
540 1995-03-21  49.562500  49.87500  ...  30.287670   104400 1995-03-22
541 1995-03-22  49.531250  49.53125  ...  30.316399    74900 1995-03-23
542 1995-03-23  49.421875  49.65625  ...  30.335543   220500        NaT

[543 rows x 8 columns]
  1. Get the difference in days for each row and the following date column we just created

    df['Difference'] = (df['Next_Date'] - df['Date']).dt.days

Output:

print(df)
          Date       Open      High  ...   Volume  Next_Date  Difference
0   1993-01-29  43.968750  43.96875  ...  1003200 1993-02-01         3.0
1   1993-02-01  43.968750  44.25000  ...   480500 1993-02-02         1.0
2   1993-02-02  44.218750  44.37500  ...   201300 1993-02-03         1.0
3   1993-02-03  44.406250  44.84375  ...   529400 1993-02-04         1.0
4   1993-02-04  44.968750  45.09375  ...   531500 1993-02-05         1.0
..         ...        ...       ...  ...      ...        ...         ...
538 1995-03-17  49.437500  49.62500  ...    89900 1995-03-20         3.0
539 1995-03-20  49.625000  49.62500  ...    91700 1995-03-21         1.0
540 1995-03-21  49.562500  49.87500  ...   104400 1995-03-22         1.0
541 1995-03-22  49.531250  49.53125  ...    74900 1995-03-23         1.0
542 1995-03-23  49.421875  49.65625  ...   220500        NaT         NaN

[543 rows x 9 columns]
  1. Filter on that difference of days by your "n_days"

    n_days = 2

    df = df[df['Difference'] <= n_days]

Output:

print(df)
          Date      Open       High  ...  Volume  Next_Date  Difference
1   1993-02-01  43.96875  44.250000  ...  480500 1993-02-02         1.0
2   1993-02-02  44.21875  44.375000  ...  201300 1993-02-03         1.0
3   1993-02-03  44.40625  44.843750  ...  529400 1993-02-04         1.0
4   1993-02-04  44.96875  45.093750  ...  531500 1993-02-05         1.0
6   1993-02-08  44.96875  45.125000  ...  596100 1993-02-09         1.0
..         ...       ...        ...  ...     ...        ...         ...
536 1995-03-15  49.50000  49.578125  ...  278500 1995-03-16         1.0
537 1995-03-16  49.43750  49.812500  ...   20400 1995-03-17         1.0
539 1995-03-20  49.62500  49.625000  ...   91700 1995-03-21         1.0
540 1995-03-21  49.56250  49.875000  ...  104400 1995-03-22         1.0
541 1995-03-22  49.53125  49.531250  ...   74900 1995-03-23         1.0

[430 rows x 9 columns]

Full Code:

import yfinance as yf

data = yf.download('spy', start='1990-01-01', end='2000-01-01')
n_days = 2

df = data[(data['Open'] > 10) & (data['Close'] < 50)]
df = df.reset_index(drop=False)
df['Next_Date'] = df['Date'].shift(-1)
df['Difference'] = (df['Next_Date'] - df['Date']).dt.days

df = df[df['Difference'] <= n_days]

CodePudding user response:

supposing you use pandas, try this, i used this code for only one condition, but most likely you can expand to use multiple conditions, put your condition logic in the condition variable

    # let df be your dataframe
    # true or false series over condition
    condition = df['state'] == ''
    # list of true false values
    condition = condition.values
    # df rows given condition
    df = df.loc[condition]
  • Related