I was working with NumPy and Pandas to create some artificial data for testing models. First, I coded this:
# Constructing some random data for experiments
import math
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
np.random.seed(42)
# Rectangular Data
total_n = 500
x = np.random.rand(total_n)*10
y = np.random.rand(total_n)*10
divider = 260
# Two lambda functions are for shifting the data, the numbers are chosen arbitrarily
f = lambda a: a*2
x[divider:] = f(x[divider:])
y[divider:] = f(y[divider:])
g = lambda a: a*3 5
x[:divider] = g(x[:divider])
y[:divider] = g(y[:divider])
# Colours array for separating the data
colors = ['blue']*divider ['red']*(total_n-divider)
squares = np.array([x,y])
plt.scatter(squares[0],squares[1], c=colors, alpha=0.5)
I got what I wanted: The Data I wanted
But I wanted to add the colors array to the numpy array, to take it as a Label variable so I added this to the code:
# Constructing some random data for experiments
import math
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
np.random.seed(42)
# Rectangular Data
total_n = 500
x = np.random.rand(total_n)*10
y = np.random.rand(total_n)*10
divider = 260
# Two lambda functions are for shifting the data, the numbers are chosen arbitrarily
f = lambda a: a*2
x[divider:] = f(x[divider:])
y[divider:] = f(y[divider:])
g = lambda a: a*3 5
x[:divider] = g(x[:divider])
y[:divider] = g(y[:divider])
# Colours array for separating the data
colors = ['blue']*divider ['red']*(total_n-divider)
squares = np.array([x,y,colors])
plt.scatter(squares[0],squares[1], c=colors, alpha=0.5)
And everything just blows out: The Blown out Data
I got my work around this by separating the label from the whole numpy array. But still what's going on here??
CodePudding user response:
Alright so I think I have the answer. A Numpy array can only have one type of data which is infered when creating the array if it is not given. When you create squares with colors in it, then squares.dtype='<U32'
, which means that all values are converted to a little-endian 32 character string.
To avoid that you can:
use a simple list
use a pandas dataframe, as they accept columns of different types
if you want to use numpy you can use a structured array as follow
zipped = [z for z in zip(x, y, colors)] #input must be a list of tuples/list representing rows #the transformation is made with zip dtype = np.dtype([('x', float), ('y', float), ('colors', 'U10')]) #type of data, 10 characters string is U10 squares = np.array(zipped, dtype=dtype) #creating the array by precising the type plt.scatter(squares["x"],squares["y"], c=squares["colors"], alpha=0.5) #when plotting call the corresponding column, just as in a dataframe