Home > Software design >  Slicing of Python Array
Slicing of Python Array

Time:01-01

I have an array which I used np. loadtext on an csv file.

dataresale = np.loadtxt(
    resale, skiprows=1, usecols=(0,2,10),
    dtype=[('month', 'U50'),
           ('flat_type', 'U50'),
           ('resale_price', 'f8')], delimiter=',')

print(dataresale['month'])

Below is the output:

['2017-01' '2017-01' '2017-01' ... '2021-03' '2021-10' '2021-12']

I would like to only take out data from year 2021 (all months) only

Below is a script I used to take out rows by year in the another array, but this particular dataset has the months tagged to it

x = datap[datax['year'] == 2019]

Is there a way I can modify the script above to take out all 2021 data?

CodePudding user response:

So, in general, numpy.ndarray objects have limited support for string operations. Notably, string slicing seems to be absent. If you look at similar questions, you can hack a slice from the front at least using a view (with a small N for the UN type). However, since your array is a structured dtype, it doesn't like creating views.

In this particular case, though, you can use the np.char.startswith function.

Some example data (please always provide this in the future, you are coming here asking for help, don't make people work to make your own question easy to answer, it's actually part of the rules, but it is also just common courtesy):

(py39) Juans-MBP:workspace juan$ cat resale.csv
2017-01,foo,4560.0
2019-01,bar,3432.34
2017-01,baz,34199.5
2019-01,baz,3232.34
2017-01,bar,932.34

Ok, so using that above:

In [1]: import numpy as np

In [2]: resale = "resale.csv"

In [3]: data = np.loadtxt(resale,dtype=[('month','U50'),('flat_type','U50'),
   ...:                                       ('resale_price','f8')],delimiter=',')

In [4]: data
Out[4]:
array([('2017-01', 'foo',  4560.  ), ('2019-01', 'bar',  3432.34),
       ('2017-01', 'baz', 34199.5 ), ('2019-01', 'baz',  3232.34),
       ('2017-01', 'bar',   932.34)],
      dtype=[('month', '<U50'), ('flat_type', '<U50'), ('resale_price', '<f8')])

In [5]: np.char.startswith(data['month'], "2019")
Out[5]: array([False,  True, False,  True, False])

In [6]: data[np.char.startswith(data['month'], "2019")]
Out[6]:
array([('2019-01', 'bar', 3432.34), ('2019-01', 'baz', 3232.34)],
      dtype=[('month', '<U50'), ('flat_type', '<U50'), ('resale_price', '<f8')])

Alternatively, though, in this case you are working with dates, which is a supported type in numpy, so you can use the following dtype: 'datetime64[D]' which will be a datetime64 but parsed by filling in the days for you:

In [14]: data = np.loadtxt(resale,dtype=[('month','datetime64[D]'),('flat_type','U50'),
    ...:                                       ('resale_price','f8')],delimiter=',')

In [8]: data
Out[8]:
array([('2017-01-01', 'foo',  4560.  ), ('2019-01-01', 'bar',  3432.34),
       ('2017-01-01', 'baz', 34199.5 ), ('2019-01-01', 'baz',  3232.34),
       ('2017-01-01', 'bar',   932.34)],
      dtype=[('month', '<M8[D]'), ('flat_type', '<U50'), ('resale_price', '<f8')])

Then you can use something like:

In [9]: data['month'] >= np.datetime64("2019")
Out[9]: array([False,  True, False,  True, False])

CodePudding user response:

Construct a sample array:

In [359]: arr = np.zeros(6, dtype=[('month', 'U50'),
     ...:            ('flat_type', 'U50'),
     ...:            ('resale_price', 'f8')])
In [360]: arr['month']=['2017-01', '2017-01', '2017-01','2021-03', '2021-10', '2
     ...: 021-12']

startswith

Since the interest is in the first for characters we can do:

In [362]: np.char.startswith(arr['month'],'2021')
Out[362]: array([False, False, False,  True,  True,  True])

which effectively is:

In [364]: [s.startswith('2021') for s in arr['month']]
Out[364]: [False, False, False, True, True, True]

The list comprehension is faster, though for better comparison lets get the indices:

In [366]: timeit np.nonzero([s.startswith('2021') for s in arr['month']])
15.1 µs ± 23.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [367]: timeit  np.nonzero(np.char.startswith(arr['month'],'2021'))
16.7 µs ± 457 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

astype truncation

But astype is a relatively quick way of truncating string dtypes, effectively the [:4] type of string slice:

In [371]: arr['month'].astype('U4')
Out[371]: array(['2017', '2017', '2017', '2021', '2021', '2021'], dtype='<U4')
In [372]: arr['month'].astype('U4')=='2021'
Out[372]: array([False, False, False,  True,  True,  True])

In [374]: timeit np.nonzero(arr['month'].astype('U4')=='2021')
6.47 µs ± 7.53 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

datetime[Y]

another option is to convert the string to datetime64

In [376]: arr['month'].astype('datetime64[Y]')
Out[376]: 
array(['2017', '2017', '2017', '2021', '2021', '2021'],
      dtype='datetime64[Y]')

With the conversion time:

In [379]: timeit np.nonzero(arr['month'].astype('datetime64[Y]')==np.array('2021
     ...: ','datetime64[Y]'))
17.5 µs ± 48.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

And if we can justify doing the conversion ahead of time:

In [380]: %%timeit yrs = arr['month'].astype('datetime64[Y]')
     ...: np.nonzero(yrs==np.array('2021','datetime64[Y]'))
6.2 µs ± 9.11 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

CodePudding user response:

If you insist on using numpy, you can extract the data you want to integer format using slicing:

strs = dayaresale['month'].copy()[:, None].view('U1')
year = strs[:, :4].view('U4').astype(int).ravel()
month = strs[:, 5:7].view('U2').astype(int).ravel()

The conversion to 2D followed by the ravel at the end allows the 'S1' view to expand into columns. The copy is necessary because the data is not completely contiguous (though it is in the column dimension).

Any mask you construct from these arrays will be applicable to the original, e.g.:

dataresale[year == 2021]

PS

The copy is really bothering me, since the original data is clearly "contiguous enough" to avoid it. If the elements of the string were not in a contiguous block, it would be understandable. I therefore propose the following alternative for string slicing, which is actually a lot cheaper and simpler in some ways:

yoffset = dataresale.dtype.fields['month'][1]
year = np.ndarray(buffer=dataresale, shape=dataresale.shape, offset=yoffset, strides=dataresale.strides, dtype='U4').astype(int)
moffset = dataresale.dtype.fields['month'][1]   dataresale.dtype.fields['month'][0].itemsize // 50 * 5
month = np.ndarray(buffer=dataresale, shape=dataresale.shape, offset=moffset, strides=dataresale.strides, dtype='U2').astype(int)

CodePudding user response:

I think you can do it with the help of pandas.Series as follows:

dataresale[pd.Series(dataresale['month']).str.match(r'^2021-')]
  • Related