I am writing code to process a list of URL's, however some of the URL's have issues and I need to pass them in my for loop. I've tried this:
x_data = []
y_data = []
for item in drop['URL']:
if re.search("J", str(item)) == True:
pass
else:
print(item)
var = urllib.request.urlopen(item)
hdul = ft.open(var)
data = hdul[0].data
start = hdul[0].header['WMIN']
finish = hdul[0].header['WMAX']
start_log = np.log10(start)
finish_log = np.log10(finish)
redshift = hdul[0].header['Z']
length = len(data[0])
xaxis = np.linspace(start, finish, length)
#calculating emitted wavelength from observed and redshift
x_axis_nr = [xaxis[j]/(redshift 1) for j in range(len(xaxis))]
gauss_kernel = Gaussian1DKernel(5/3)
flux = np.convolve(data[0], gauss_kernel)
wavelength = np.convolve(x_axis_nr, gauss_kernel)
x_data.append(x_axis_nr)
y_data.append(data[0])
where drop is a previously defined pandas DataFrame. Previous questions on this topic suggested regex might be the way to go, and I have tried this to filter out any URL containing the letter J (which are only the bad ones).
I get this:
http://www.gama-survey.org/dr3/data/spectra/sdss/spec-0915-52443-0581.fit
http://www.gama-survey.org/dr3/data/spectra/sdss/spec-0915-52443-0582.fit
http://www.gama-survey.org/dr3/data/spectra/sdss/spec-0915-52443-0584.fit
http://www.gama-survey.org/dr3/data/spectra/sdss/spec-0915-52443-0587.fit
http://www.gama-survey.org/dr3/data/spectra/sdss/spec-0915-52443-0589.fit
http://www.gama-survey.org/dr3/data/spectra/sdss/spec-0915-52443-0592.fit
http://www.gama-survey.org/dr3/data/spectra/2qz/J113606.3 001155a.fit
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-8-2a3083a3a6d7> in <module>
14 finish_log = np.log10(finish)
15 redshift = hdul[0].header['Z']
---> 16 length = len(data[0])
17
18 xaxis = np.linspace(start, finish, length)
TypeError: object of type 'numpy.float32' has no len()
which is the same kind of error I was having before trying to remove J urls, so clearly my regex is not working. I would appreciate some advice on how to filter these, and am happy to provide more information as required.
CodePudding user response:
There's no need to compare the result of re.search
with True
. From documentation
you can see that search
returns a match object
when a match is found:
Scan through string looking for the first location where the regular expression pattern produces a match, and return a corresponding match object. Return
None
if no position in the string matches the pattern; note that this is different from finding a zero-length match at some point in the string.
So, when comparing a match object
with True
the return is False
and your else
condition is executed.
In [35]: re.search('J', 'http://www.gama-survey.org/dr3/data/spectra/2qz/J113606.3 001155a.fit') == True
Out[35]: False