I am working on a final assignment for a python data science course and we are supposed to process data for sunspots from 1700 to 2019. This is processing basic data and developing visualizations for it using matplotlib. I asked the instructor about using Pandas, but we are only allowed to use the Numpy library for this project. We also have not learned about classes, so I assume that is also off limits.
Someone has asked and solved the entire problem at the following link. I looked at their solution for guidance (I don't need the whole thing solved, I just need to be pointed in the right direction), but it used Pandas. External Link to Assignment Posted on Chegg
The data comes in as a csv in the following form:
record,year,sunspots
1,1700,5
2,1701,11
3,1702,16
4,1703,23
...
316,2015,130
317,2016,133
318,2017,127.9
319,2018,144
320,2019,141
Based on the prompt, I believe the idea is to have the data read out as a complete table which looks something like:
Min Max Total Average Stdev
18th C. 0 154 4544 45.44 35.79
19th C. 0 139 4218 42.18 33.35
20th C. 1 190 6256 61.56 46.28
21st C. 31 229 2451 122.55 53.05
Currently I have the data reading in correctly (I think) as follows:
# importing libraries
import numpy as np
import matplotlib.pyplot as mpl
# importing file and assigning relevant header information
sunspot_data = np.genfromtxt('project2_dataset.csv', delimiter=',', skip_header=False, dtype=str)
header = sunspot_data[0]
spot_data = sunspot_data[1:]
# indicating the data types and where they begin within the csv
record = spot_data[:, 0].astype(int)
year = spot_data[:, 1].astype(int)
num_spots = spot_data[:, 2].astype(float)
# creating the empty array and creating the arrays for the row and column headers
data_array = np.zeros((5, 6))
row_header = np.array(['','18th C.', '19th C.', '20th C.', '21st C.']).astype(str)
column_header = np.array(['','Minimum', 'Maximum', 'Total', 'Average', 'Standard Dev.']).astype(str)
The problem I am having is that I am running a 'for' loop to get the various values, but I cannot get them to store as an array to be able to populate a np.array. The code which I am currently using is as follows:
# defining the centuries within the data
cen18 = num_spots[0:100].astype(int)
cen19 = num_spots[100:200].astype(int)
cen20 = num_spots[200:300].astype(int)
cen21 = num_spots[300:].astype(int)
# creates a list of the centuries for processing
century_list = [cen18,cen19, cen20, cen21]
# for loop to get the descriptive statistics
for lists in century_list:
min_list = np.array(np.min(lists))
max_list = np.array(np.max(lists))
sum_list = np.array(np.sum(lists))
mean_list = np.array(np.mean(lists))
stdev_list = np.array(np.std(lists))
I am trying to get this to print correctly, but the following is the code I have written and what its output currently is.
in:
# attempt to insert the data within the array created above
data_array[1:,1] = min_list
data_array[1:,2] = max_list
data_array[1:,3] = sum_list
data_array[1:,4] = mean_list
data_array[1:,5] = stdev_list
print(data_array)
out:
[[ 0. 0. 0. 0. 0. 0. ]
[ 0. 33. 229. 2451. 122.55 53.05136662]
[ 0. 33. 229. 2451. 122.55 53.05136662]
[ 0. 33. 229. 2451. 122.55 53.05136662]
[ 0. 33. 229. 2451. 122.55 53.05136662]]
row 0 and col 0 should be headers as seen above, which is a whole different issue to solve...
So I guess my question is - how can I get that output to correctly go into a np.array, and when I move on to process the data on the decade-level, how can I do that efficiently without going through and creating a new variable for each decade?
CodePudding user response:
You could try this:
# your example data
a = np.genfromtxt(io.StringIO("""
1,1700,5
2,1701,11
3,1702,16
4,1703,23
316,2015,130
317,2016,133
318,2017,127.9
319,2018,144
320,2019,141"""), delimiter=',')[:, 1:].copy()
# a kind of groupby -- requires the centuries in the data to be contiguous
century, ix = np.unique(a[:, 0].astype(int) // 100, return_index=True)
out = np.c_[century 1, [
(v.min(), v.max(), v.sum(), v.mean(), v.std())
for v in np.split(a[:,1], ix[1:])
]]
>>> out.round(3)
array([[ 18. , 5. , 23. , 55. , 13.75 , 6.61 ],
[ 21. , 127.9 , 144. , 675.9 , 135.18 , 6.265]])
(That is read as: in the 18th century, min was 5, max was 23, total was 55, average was 13.75, stddev was 6.61).
Important: the data needs to be ordered by year (in order to make sure each century group is contiguous. If it isn't, you need to sort it first.
Source of inspiration & credits due to: This answer about groupby
in numpy
by Vincent J.