I have spent many many hours attempting this question, including asking tutors and other peers for their feedback. Nobody seems to be able to figure it out. I am quite desperate as this assignment is due today and i need this code in order to test my other code. The question is as follows.
"The datatype of the array created by NumPy in Task 1 is unstructured. This is because, in the default setting, NumPy decides the datatype for each value. Also, the output in Task 1 contains the header row that may not be required in our assignment. So, remove the header row and convert all the columns to type float (i.e., "float") apart from the columns specified by the input parameter indexes (mentioned below). Also, the remaining columns which are not mentioned in indexes should be in Unicode of length 30 characters (i.e., "<U30"). Finally, every row is converted as a type tuple (e.g., tuple(i) for i in data).
Write a function data_type_format(data, indexes) that can complete the above-mentioned task, where, the input data is a NumPy array and indexes contains the column indices (in list) which are to be converted into <U30 data type, and the remaining columns in data which are not in indexes will be converted to type float."
This is the code i used in Task 1:
import numpy as np
import pandas as pd
def load_mydata(filename):
"return"
df = pd.read_csv(filename,delimiter=',',quotechar="",quoting=3,header=None)
df = df.iloc[:,[0,1,2,5,8,9,10,11,12]]
ndarray = np.array(df,dtype='U30')
return ndarray
this is some of the codes i have tried:
def data_type_format(data,indexes):
"return"
list=[float(element) if indx not in indexes else
str(ord(element)) if len(element)<30 else
element for indx,element in data]
return tuple(list)
import csv
def data_type_format(data,indexes):
"return csv file data as tuple formatted"
list=[]
df = pd.read_csv(data,delimiter=',',quotechar="",quoting=3,header=None)
df = df.iloc[:,[0,1,2,5,8,9,10,11,12]]
ndarray = np.array(df,dtype='U30')
for df in data:
if df in indexes:
df[indexes] = df[indexes].astype(float)
else:
df[indexes]= df[indexes].astype('U30')
list.append(df)
return tuple(list)
The code needs to be tested in a python shell with the following test.
data = load_mydata("location_review_data.csv")
data = data_type_format(data, [0, 1, 2, 3])
print(data[0])
#gives result
('Thu Jan 30 06:58:27 0000 2020', '98675', '1', '22847', 0.421, 0.442, 0.452, 0.397, 0.357)]1
i have tried many different versions and none are working ! the cvs file used : https://drive.google.com/file/d/1Qd-xFaK4z5ijSHTmmtR7VhJyFaBzjFvO/view?usp=sharing
CodePudding user response:
With the downloaded csv I was able to get
In [181]: import pandas as pd
In [182]: df = pd.read_csv('../Downloads/location_review_data.csv')
In [183]: df
Out[183]:
created_at user_ID review_ID latitude longitude ... sad happy surprise disgust joy
0 Thu Jan 30 06:58:27 0000 2020 98675 1 30.235909 -97.795140 ... 0.421 0.442 0.452 0.397 0.357
1 Thu Jan 30 06:03:24 0000 2020 67730 2 30.269103 -97.749395 ... 0.469 0.408 0.488 0.377 0.350
2 Thu Jan 30 06:19:25 0000 2020 11576 3 30.255731 -97.763386 ... 0.542 0.361 0.276 0.270 0.424
3 Thu Jan 30 06:16:38 0000 2020 87911 4 30.263418 -97.757597 ... 0.418 0.499 0.352 0.367 0.291
4 Thu Jan 30 06:08:09 0000 2020 148147 5 30.274292 -97.740523 ... 0.242 0.632 0.532 0.501 0.199
... ... ... ... ... ... ... ... ... ... ... ...
19995 Fri Feb 07 23:47:25 0000 2020 21591 19996 42.357491 -71.058885 ... 0.437 0.431 0.609 0.446 0.260
19996 Fri Feb 07 23:21:33 0000 2020 37809 19997 42.357437 -71.058470 ... 0.432 0.412 0.549 0.452 0.219
19997 Fri Feb 07 23:34:48 0000 2020 39721 19998 42.357553 -71.057779 ... 0.597 0.314 0.311 0.304 0.428
19998 Sat Feb 08 00:18:22 0000 2020 20873 19999 42.357142 -71.058455 ... 0.452 0.479 0.518 0.442 0.252
19999 Fri Feb 07 23:13:02 0000 2020 33846 20000 42.357069 -71.058551 ... 0.391 0.490 0.548 0.459 0.199
[20000 rows x 13 columns]
A structured array from that:
In [184]: data = df.to_records()
In [185]: data.shape
Out[185]: (20000,)
In [186]: data.dtype
Out[186]: dtype((numpy.record, [('index', '<i8'), ('created_at', 'O'), ('user_ID', '<i8'), ('review_ID', '<i8'), ('latitude', '<f8'), ('longitude', '<f8'), ('location_ID', '<i8'), ('friend_count', 'O'), ('follower_count', 'O'), ('sad', '<f8'), ('happy', '<f8'), ('surprise', '<f8'), ('disgust', '<f8'), ('joy', '<f8')]))
The 'O' dtype fields contain strings as Python objects, which is pandas
practice.
With numpy's csv reader:
In [192]: data = np.genfromtxt('../Downloads/location_review_data.csv', names=True, dtype=None, delimiter=',', encoding=None, comments=None)
In [193]: data.shape
Out[193]: (20000,)
In [194]: data.dtype
Out[194]: dtype([('created_at', '<U30'), ('user_ID', '<i8'), ('review_ID', '<i8'), ('latitude', '<f8'), ('longitude', '<f8'), ('location_ID', '<i8'), ('friend_count', '<U8'), ('follower_count', '<U22'), ('sad', '<f8'), ('happy', '<f8'), ('surprise', '<f8'), ('disgust', '<f8'), ('joy', '<f8')])
The '#' in some of the columns gave me problems, since that is the default comment character. I had to inactivate that.
first 3 lines/records:
In [195]: data[:3]
Out[195]:
array([('Thu Jan 30 06:58:27 0000 2020', 98675, 1, 30.23590912, -97.79513958, 22847, '#####', '############', 0.421, 0.442, 0.452, 0.397, 0.357),
('Thu Jan 30 06:03:24 0000 2020', 67730, 2, 30.26910295, -97.74939537, 420315, '#####', '############', 0.469, 0.408, 0.488, 0.377, 0.35 ),
('Thu Jan 30 06:19:25 0000 2020', 11576, 3, 30.25573099, -97.76338577, 316637, '#####', '##########', 0.542, 0.361, 0.276, 0.27 , 0.424)],
dtype=[('created_at', '<U30'), ('user_ID', '<i8'), ('review_ID', '<i8'), ('latitude', '<f8'), ('longitude', '<f8'), ('location_ID', '<i8'), ('friend_count', '<U8'), ('follower_count', '<U22'), ('sad', '<f8'), ('happy', '<f8'), ('surprise', '<f8'), ('disgust', '<f8'), ('joy', '<f8')])