I have a program that scrapes through an API and gets the required values from the fields. There is a field called published_date one act json object. I want to publish only the values for the last 2 months from current date.
try:
price = str(price).replace(',', '')
price = Decimal(price)
if date < end:
if not math.isnan(price):
report_item = PriceItem(
source=SOURCE,
source_url=crawled_url,
original_index_id=original_index_id,
index_specification=index_specification,
published_date=date,
price=price.quantize(Decimal('1.00'))
)
yield report_item
except DecimalException as ex:
self.logger.error(f"Non decimal price of {price} "
f"found in {original_index_id}", exc_info=ex)
The published date is extracted:
for report_date in REPORT_DATE_TYPES:
if report_date in result:
date = result[report_date].split(' ')[0]
date = datetime.strptime(date, '%m/%d/%Y')
MAX_REPORT_MONTHS = 3
current_date = datetime.now()
current_date_str = current_date.strftime('%m/%d/%Y')
start = datetime.strptime(current_date_str, '%m/%d/%Y')
last_date = current_date - relativedelta(months=MAX_REPORT_MONTHS)
last_date_str = last_date.strftime('%m/%d/%Y')
end = datetime.strptime(last_date_str, '%m/%d/%Y')
The above I say last date string and current date string.
Extract of the api:
CodePudding user response:
After having gathered the data into a dataframe you can convert the column containing the dates to datetime and then through comparison operators mantain just the desidered data.
For example, assuming this is your data:
data = {'date': ['02/02/2022 10:23:23', '09/23/2021 10:23:23', '02/01/2021 10:23:23', '12/15/2021 10:23:23'], 'random': [324, 231, 213, 123]}
df = pd.DataFrame(data)
# convert date column to datetime
df['date'] = pd.to_datetime(df['date'], format="%m/%d/%Y %H:%M:%S")
# select "threshold" date, two months before current one
current_date = datetime.now()
last_date = current_date - relativedelta(months=2)
# select data published after last_date
df[df['date'] > last_date]
If we consider the date of today we will have this result.
Before:
date random
0 02/02/2022 10:23:23 324
1 09/23/2021 10:23:23 231
2 02/01/2021 10:23:23 213
3 12/15/2021 10:23:23 123
After:
date random
0 2022-02-02 10:23:23 324
3 2021-12-15 10:23:23 123