I need help with the following code. Please find my code attached and my desired output just below as well.
Background on the function i'm building:
Should take a 2d numpy array as input. Here is a sample of the array:
array([[ 1960, 54211],
[ 1961, 55438],
[ 1962, 56225],
[ 1963, 56695],
[ 1964, 57032],
[ 1965, 57360],
[ 1966, 57715]...
Should split the array such that X is the year, and y is the corresponding population. Should return two tuples of the form (X_train, y_train), (X_test, y_test). Should use sklearn's train_test_split function with a test_size = 0.2 and random_state = 42.
My current code:
def f_r_split(arr):
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)
return (X_train, y_train), (X_test, y_test)
Input code (not changeable):
data = get_year_pop('Aruba')
f_r_split(data)
Current output:
((array([ 83200, 64622, 58386, 60366, 57715, 57032, 92898, 59980,
62149, 101453, 101669, 103795, 60657, 58726, 61833, 62644,
60586, 62836, 72504, 104341, 90853, 59440, 68235, 104822,
97017, 61032, 103187, 55438, 60567, 56225, 100031, 89005,
80324, 62201, 101220, 59063, 61345, 60103, 105264, 60096,
58055, 94992, 60528, 61079, 102053, 87277]),
array([ 83200, 64622, 58386, 60366, 57715, 57032, 92898, 59980,
62149, 101453, 101669, 103795, 60657, 58726, 61833, 62644,
60586, 62836, 72504, 104341, 90853, 59440, 68235, 104822,
97017, 61032, 103187, 55438, 60567, 56225, 100031, 89005,
80324, 62201, 101220, 59063, 61345, 60103, 105264, 60096,
58055, 94992, 60528, 61079, 102053, 87277])),
(array([ 54211, 57360, 76700, 60243, 98737, 102577, 85451, 63026,
100832, 59840, 101353, 56695]),
array([ 54211, 57360, 76700, 60243, 98737, 102577, 85451, 63026,
100832, 59840, 101353, 56695])))
Desired output:
X_train == array([1996, 1991, 1968, 1977, 1966, 1964, 2001, 1979, 1990, 2009, 2010,
2014, 1975, 1969, 1987, 1986, 1976, 1984, 1993, 2015, 2000, 1971,
1992, 2016, 2003, 1989, 2013, 1961, 1981, 1962, 2005, 1999, 1995,
1983, 2007, 1970, 1982, 1978, 2017, 1980, 1967, 2002, 1974, 1988,
2011, 1998])
y_train == array([ 83200, 64622, 58386, 60366, 57715, 57032, 92898, 59980,
62149, 101453, 101669, 103795, 60657, 58726, 61833, 62644,
60586, 62836, 72504, 104341, 90853, 59440, 68235, 104822,
97017, 61032, 103187, 55438, 60567, 56225, 100031, 89005,
80324, 62201, 101220, 59063, 61345, 60103, 105264, 60096,
58055, 94992, 60528, 61079, 102053, 87277])
X_test == array([1960, 1965, 1994, 1973, 2004, 2012, 1997, 1985, 2006, 1972, 2008,
1963])
y_test == array([ 54211, 57360, 76700, 60243, 98737, 102577, 85451, 63026,
100832, 59840, 101353, 56695])
CodePudding user response:
You are assigning the same vector to both X
and y
because [:,-1]
means you are taking the last column.
The first line of the function should be:
X, y = arr[:, 0], arr[:, 1]
Or in alternative:
X, y = arr.T