Home > Back-end >  Specific output format for a numpy array that needs to be split for train_test_split
Specific output format for a numpy array that needs to be split for train_test_split

Time:08-03

I need help with the following code. Please find my code attached and my desired output just below as well.

Background on the function i'm building:

Should take a 2d numpy array as input. Here is a sample of the array:

array([[  1960,  54211],
       [  1961,  55438],
       [  1962,  56225],
       [  1963,  56695],
       [  1964,  57032],
       [  1965,  57360],
       [  1966,  57715]...

Should split the array such that X is the year, and y is the corresponding population. Should return two tuples of the form (X_train, y_train), (X_test, y_test). Should use sklearn's train_test_split function with a test_size = 0.2 and random_state = 42.

My current code:

def f_r_split(arr):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)
    return (X_train, y_train), (X_test, y_test)

Input code (not changeable):

data = get_year_pop('Aruba')
f_r_split(data)

Current output:

((array([ 83200,  64622,  58386,  60366,  57715,  57032,  92898,  59980,
          62149, 101453, 101669, 103795,  60657,  58726,  61833,  62644,
          60586,  62836,  72504, 104341,  90853,  59440,  68235, 104822,
          97017,  61032, 103187,  55438,  60567,  56225, 100031,  89005,
          80324,  62201, 101220,  59063,  61345,  60103, 105264,  60096,
          58055,  94992,  60528,  61079, 102053,  87277]),
  array([ 83200,  64622,  58386,  60366,  57715,  57032,  92898,  59980,
          62149, 101453, 101669, 103795,  60657,  58726,  61833,  62644,
          60586,  62836,  72504, 104341,  90853,  59440,  68235, 104822,
          97017,  61032, 103187,  55438,  60567,  56225, 100031,  89005,
          80324,  62201, 101220,  59063,  61345,  60103, 105264,  60096,
          58055,  94992,  60528,  61079, 102053,  87277])),
 (array([ 54211,  57360,  76700,  60243,  98737, 102577,  85451,  63026,
         100832,  59840, 101353,  56695]),
  array([ 54211,  57360,  76700,  60243,  98737, 102577,  85451,  63026,
         100832,  59840, 101353,  56695])))

Desired output:

X_train == array([1996, 1991, 1968, 1977, 1966, 1964, 2001, 1979, 1990, 2009, 2010,
       2014, 1975, 1969, 1987, 1986, 1976, 1984, 1993, 2015, 2000, 1971,
       1992, 2016, 2003, 1989, 2013, 1961, 1981, 1962, 2005, 1999, 1995,
       1983, 2007, 1970, 1982, 1978, 2017, 1980, 1967, 2002, 1974, 1988,
       2011, 1998])

y_train == array([ 83200,  64622,  58386,  60366,  57715,  57032,  92898,  59980,
        62149, 101453, 101669, 103795,  60657,  58726,  61833,  62644,
        60586,  62836,  72504, 104341,  90853,  59440,  68235, 104822,
        97017,  61032, 103187,  55438,  60567,  56225, 100031,  89005,
        80324,  62201, 101220,  59063,  61345,  60103, 105264,  60096,
        58055,  94992,  60528,  61079, 102053,  87277])

X_test == array([1960, 1965, 1994, 1973, 2004, 2012, 1997, 1985, 2006, 1972, 2008,
       1963])

y_test == array([ 54211,  57360,  76700,  60243,  98737, 102577,  85451,  63026,
       100832,  59840, 101353,  56695])

CodePudding user response:

You are assigning the same vector to both X and y because [:,-1] means you are taking the last column.

The first line of the function should be:

X, y = arr[:, 0], arr[:, 1]

Or in alternative:

X, y = arr.T
  • Related