Home > Mobile >  Python: Using only Numpy to process data by date ranges
Python: Using only Numpy to process data by date ranges

Time:12-16

I am working on a final assignment for a python data science course and we are supposed to process data for sunspots from 1700 to 2019. This is processing basic data and developing visualizations for it using matplotlib. I asked the instructor about using Pandas, but we are only allowed to use the Numpy library for this project. We also have not learned about classes, so I assume that is also off limits.

Someone has asked and solved the entire problem at the following link. I looked at their solution for guidance (I don't need the whole thing solved, I just need to be pointed in the right direction), but it used Pandas. External Link to Assignment Posted on Chegg

The data comes in as a csv in the following form:

record,year,sunspots
1,1700,5
2,1701,11
3,1702,16
4,1703,23
...
316,2015,130
317,2016,133
318,2017,127.9
319,2018,144
320,2019,141

Based on the prompt, I believe the idea is to have the data read out as a complete table which looks something like:

        Min   Max   Total   Average  Stdev
18th C. 0     154   4544    45.44    35.79
19th C. 0     139   4218    42.18    33.35
20th C. 1     190   6256    61.56    46.28
21st C. 31    229   2451    122.55   53.05

Currently I have the data reading in correctly (I think) as follows:

# importing libraries
import numpy as np
import matplotlib.pyplot as mpl

# importing file and assigning relevant header information
sunspot_data = np.genfromtxt('project2_dataset.csv', delimiter=',', skip_header=False, dtype=str)
header = sunspot_data[0]
spot_data = sunspot_data[1:]

# indicating the data types and where they begin within the csv
record = spot_data[:, 0].astype(int)
year = spot_data[:, 1].astype(int)
num_spots = spot_data[:, 2].astype(float)

# creating the empty array and creating the arrays for the row and column headers
data_array = np.zeros((5, 6))
row_header = np.array(['','18th C.', '19th C.', '20th C.', '21st C.']).astype(str)
column_header = np.array(['','Minimum', 'Maximum', 'Total', 'Average', 'Standard Dev.']).astype(str)

The problem I am having is that I am running a 'for' loop to get the various values, but I cannot get them to store as an array to be able to populate a np.array. The code which I am currently using is as follows:

# defining the centuries within the data
cen18 = num_spots[0:100].astype(int)
cen19 = num_spots[100:200].astype(int)
cen20 = num_spots[200:300].astype(int)
cen21 = num_spots[300:].astype(int)

# creates a list of the centuries for processing
century_list = [cen18,cen19, cen20, cen21]

# for loop to get the descriptive statistics 
for lists in century_list:
    min_list = np.array(np.min(lists))
    max_list = np.array(np.max(lists))
    sum_list = np.array(np.sum(lists))
    mean_list = np.array(np.mean(lists))
    stdev_list = np.array(np.std(lists))

I am trying to get this to print correctly, but the following is the code I have written and what its output currently is.

in:

# attempt to insert the data within the array created above
data_array[1:,1] = min_list
data_array[1:,2] = max_list
data_array[1:,3] = sum_list
data_array[1:,4] = mean_list
data_array[1:,5] = stdev_list
print(data_array)

out:

[[   0.           0.           0.           0.           0.       0.        ]
 [   0.          33.         229.        2451.         122.55    53.05136662]
 [   0.          33.         229.        2451.         122.55    53.05136662]
 [   0.          33.         229.        2451.         122.55    53.05136662]
 [   0.          33.         229.        2451.         122.55    53.05136662]]

row 0 and col 0 should be headers as seen above, which is a whole different issue to solve...

So I guess my question is - how can I get that output to correctly go into a np.array, and when I move on to process the data on the decade-level, how can I do that efficiently without going through and creating a new variable for each decade?

CodePudding user response:

You could try this:

# your example data
a = np.genfromtxt(io.StringIO("""
1,1700,5
2,1701,11
3,1702,16
4,1703,23
316,2015,130
317,2016,133
318,2017,127.9
319,2018,144
320,2019,141"""), delimiter=',')[:, 1:].copy()

# a kind of groupby -- requires the centuries in the data to be contiguous
century, ix = np.unique(a[:, 0].astype(int) // 100, return_index=True)
out = np.c_[century   1, [
    (v.min(), v.max(), v.sum(), v.mean(), v.std())
    for v in np.split(a[:,1], ix[1:])
]]

>>> out.round(3)
array([[ 18.   ,   5.   ,  23.   ,  55.   ,  13.75 ,   6.61 ],
       [ 21.   , 127.9  , 144.   , 675.9  , 135.18 ,   6.265]])

(That is read as: in the 18th century, min was 5, max was 23, total was 55, average was 13.75, stddev was 6.61).

Important: the data needs to be ordered by year (in order to make sure each century group is contiguous. If it isn't, you need to sort it first.

Source of inspiration & credits due to: This answer about groupby in numpy by Vincent J.

  • Related