Home > Back-end >  How to build a Pandas Dataframe with a Numpy Array from Imported CSV data with multiple numbers
How to build a Pandas Dataframe with a Numpy Array from Imported CSV data with multiple numbers

Time:01-06

I'm a little stumped on this one. I've created a proof of concept where I built a Pandas Dataframe with a static Numpy Array of numbers. I got this working fine, but now I'm taking it a step further and importing a CSV file to build this same Dataframe and Numpy Array. Here is the snippet of the file and what I've written. I want to take the second column of 'numbers' and build an array of 6 numbers per line. For example, [[11],[21],[27],[36],[62],[24]], [[14],[18],[36],[49],[67],[18]], etc.

CSV:

date,numbers,multiplier
09/26/2020,11 21 27 36 62 24,3
09/30/2020,14 18 36 49 67 18,2
10/03/2020,18 31 36 43 47 20,2

CODE:

data = pd.read_csv('pbhistory.csv')
data['date'] = pd.to_datetime(data.date, infer_datetime_format=True)
data.sort_values(by='date', ascending=True, inplace=True)
df = pd.DataFrame(data.numbers).to_numpy()
df2 = pd.DataFrame(df, columns=['1', '2', '3', '4', '5', '6'])
print(df2.head())

ERROR: I'm expecting 6 columns of data from df2 as I thought it was converted to an array properly after importing the 'numbers' column from the CSV, but I get the following:

ValueError: Shape of passed values is (1414, 1), indices imply (1414, 6)

So, I change the code to df2 = pd.DataFrame(df, columns=['1']) and get the following output. The problem is, I need it to be in 6 columns, not 1.

                   1
0  11 21 27 36 62 24
1  14 18 36 49 67 18
2  18 31 36 43 47 20

So, as you can see, I'm only getting one column with all numbers, instead of an array of numbers with 6 columns.

CodePudding user response:

Remember that CSV stands for Comma Separated Values, ie it reads everything between two commas as one input. If you want the numbers separated you have to put commas between them, otherwise you'll have to parse the longer text of 6 non-comma separated values and rebuild the dataframe.

CodePudding user response:

data = pd.read_csv('pbhistory.csv')
data['date'] = pd.to_datetime(data.date, infer_datetime_format=True)
data.sort_values(by='date', ascending=True, inplace=True)
df = pd.DataFrame(data.numbers).to_numpy()

Then split it first

df2 = df['numbers'].str.split(' ', expand=True)
  • Related