MatplotLib.pyplot.scatter not plotting normally when a new list added to the array-CodePudding

I was working with NumPy and Pandas to create some artificial data for testing models. First, I coded this:

# Constructing some random data for experiments

import math
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
np.random.seed(42)

# Rectangular Data

total_n = 500
x = np.random.rand(total_n)*10
y = np.random.rand(total_n)*10

divider = 260

# Two lambda functions are for shifting the data, the numbers are chosen arbitrarily
f = lambda a: a*2
x[divider:] = f(x[divider:])
y[divider:] = f(y[divider:])

g = lambda a: a*3   5
x[:divider] = g(x[:divider])
y[:divider] = g(y[:divider])

# Colours array for separating the data
colors = ['blue']*divider   ['red']*(total_n-divider)

squares = np.array([x,y])

plt.scatter(squares[0],squares[1], c=colors, alpha=0.5)

I got what I wanted: The Data I wanted

But I wanted to add the colors array to the numpy array, to take it as a Label variable so I added this to the code:

# Constructing some random data for experiments

import math
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
np.random.seed(42)

# Rectangular Data

total_n = 500
x = np.random.rand(total_n)*10
y = np.random.rand(total_n)*10

divider = 260

# Two lambda functions are for shifting the data, the numbers are chosen arbitrarily
f = lambda a: a*2
x[divider:] = f(x[divider:])
y[divider:] = f(y[divider:])

g = lambda a: a*3   5
x[:divider] = g(x[:divider])
y[:divider] = g(y[:divider])

# Colours array for separating the data
colors = ['blue']*divider   ['red']*(total_n-divider)

squares = np.array([x,y,colors])

plt.scatter(squares[0],squares[1], c=colors, alpha=0.5)

And everything just blows out: The Blown out Data

I got my work around this by separating the label from the whole numpy array. But still what's going on here??

CodePudding user response：

Alright so I think I have the answer. A Numpy array can only have one type of data which is infered when creating the array if it is not given. When you create squares with colors in it, then squares.dtype='<U32', which means that all values are converted to a little-endian 32 character string.

To avoid that you can:

use a simple list
use a pandas dataframe, as they accept columns of different types

if you want to use numpy you can use a structured array as follow

zipped = [z for z in zip(x, y, colors)]
#input must be a list of tuples/list representing rows
#the transformation is made with zip

dtype = np.dtype([('x', float), ('y', float), ('colors', 'U10')])
#type of data, 10 characters string is U10
squares  = np.array(zipped, dtype=dtype)
#creating the array by precising the type

plt.scatter(squares["x"],squares["y"], c=squares["colors"], alpha=0.5)
#when plotting call the corresponding column, just as in a dataframe