I'm writing a programme where the initial step is to split a filename into two components. The files have the format: 12080103_20220809191000.nc
where the number before the underscore is the file name and the string after is the date (2022/08/09 19:10:00 in this case).
I'm splitting the file as follows:
filename = os.path.basename(pathname)
sn_file = filename.split("_")
file_date = dt.datetime.strptime(sn_file[1], "%Y%m%d%H%M%S.nc")
sn=sn_file[0]
However, this gives the error: ValueError: time data '12070069' does not match format '%Y%m%d%H%M%S.nc'
which shows that somehow the first string of characters is getting mixed up with the second.
I have no clue how this is happening or why it's happening. Any advice would be a great help
EDIT: As requested, here is the full code:
import xarray as xr
import datetime as dt
import os
from pathlib import Path
import pandas as pd
import plotly.express as px
import numpy as np
def sn_date_fromfile(pathname):
filename = os.path.basename(pathname)
sn_file = filename.split("_")
print(sn_file)
file_date = dt.datetime.strptime(sn_file[1], "%Y%m%d%H%M%S.nc")
sn=sn_file[0]
return file_date, sn, pathname
def plot_single_cam(dat, stat, title=""):
#dat[stat]=dat[stat].fillna(0.).where(dat[stat]>2000)
#dat[stat]=dat[stat].fillna(0.).where(dat[stat]<500)
arr = np.array(dat[stat])
arr_max = np.quantile(arr, 0.95)
arr_min = np.quantile(arr, 0.05)
# awful - fgure out hwo to do quantile
arr_max = arr_max - (arr_max * 0.98)
arr_min = arr_min (arr_min * 1.02)
arr[arr < 1000] = 1000
arr[arr > 2000] = 2000
dat[stat].values = arr
fig = px.imshow((dat[stat] - 1000) / 10,
color_continuous_scale='temps',
origin='lower',
animation_frame="time",
aspect="equal",
contrast_rescaling="minmax",
width=750,
height=750,
title=title,
)
return fig
hours_ago = 0.5
# inside /var/www/data/PI-160/ are the .nc files. Change the directory to match where the files are
paths = sorted(Path('/Volumes/1A/file1').iterdir(), key=os.path.getmtime)
file_meta = pd.DataFrame(
[sn_date_fromfile(path) for path in paths],
columns=["sn", "file_time", "path_name"],
)
file_meta = file_meta[
file_meta.file_time >
(dt.datetime.utcnow() - dt.timedelta(hours = hours_ago))
]
stat = "t_b_snapshot"
try:
os.remove("/var/www/html/static_plots/static.html")
except:
pass
with open("/var/www/html/static_plots/static.html", 'a') as f:
for sn in file_meta["sn"].unique():
file_meta_sn = file_meta[file_meta['sn'] == sn]
dat = xr.open_mfdataset(file_meta_sn['path_name'], engine="netcdf4")
fig = plot_single_cam(dat, stat, title=f"{sn} updated {dt.datetime.utcnow()} UTC")
f.write(fig.to_html(full_html=False, include_plotlyjs='cdn'))
CodePudding user response:
Using python 3.8.10 I cannot reproduce this - datetime correctly parses the time string for the example you gave. I feel the error is in the reliance on the underscore split and the use of sn_file[1]
. Is it possible for the file to contain another underscore before the final expected one? Try using sn_file[-1]
to get the last member of the split which is what we want.
I'm assuming you're doing this for many files, some of which do not have this problem. Try using a try
statement to catch the ValueError
to print filename
and sn_file
, e.g.
filename = os.path.basename(pathname)
sn_file = filename.split("_")
try:
file_date = dt.datetime.strptime(sn_file[1], "%Y%m%d%H%M%S.nc")
except:
print(f"error with filename \'{filename}\' {sn_file}")
sn=sn_file[0]
CodePudding user response:
The issue is with either the file naming or the splitting, not with the datetime parsing. Basically you're passing the wrong chunk of the filename to dt.datetime.strptime()
.
One potential fix for this would be to not rely on splitting the filename at the _
, but instead to use a regex to look for the part of the filename with the appropriate format. This would also eliminate the need for parsing the string for the datetime.
For example:
import datetime as dt
import re
filename = '12080103_20220809191000.nc'
matches = re.search('(\d{4})(\d{2})(\d{2})(\d{2})(\d{2})(\d{2})', filename)
matches = [int(m) for m in matches.groups()]
datetime = dt.datetime(*matches)