I am importing the CSV file from here: https://raw.githubusercontent.com/kwartler/Harvard_DataMining_Business_Student/master/BookDataSets/LaptopSales.csv
This code works:
from dfply import *
import pandas as pd
df = pd.read_csv("LaptopSales.csv")
(df >> select(X["Date"]) >> mutate(AdjDate = (X.Date.str.split(" "))) >> head(3))
and produces this result:
Date AdjDate
0 01-01-2008 00:01 [01-01-2008, 00:01]
1 01-01-2008 00:02 [01-01-2008, 00:02]
2 01-01-2008 00:04 [01-01-2008, 00:04]
But when I try to extract the first element in the list:
from dfply import *
import pandas as pd
df = pd.read_csv("LaptopSales.csv")
(df >> select(X["Date"]) >> mutate(AdjDate = (X.Date.str.split(" ")[0])) >> head(3))
I get a wall of error culminating in:
ValueError: Length of values (2) does not match length of index (279999)
CodePudding user response:
AdjDate = (X.Date.str.split(" ")[0]))
Is in fact comparing 2 series index by index and return a series with the length of primary series.
Then you can not store it in a 2 lengthed variable and pandas raise error
CodePudding user response:
The answer is that one of the rows in the CSV file contains a value in the Date column that is NaN. That value can't be split on " ". Nan is a float: since the split fails to create a list, then the indexing operation fails. It's row 2913 in the .CSV file: ",51,SE14 6LA,SE8 3JD,460,15,4,2,1.5,Yes,80,Yes,536682,177068,537175,177885"
The reason I didn't simply delete the question is because the data set is publicly available and appears to be part of a course available through Harvard University: https://github.com/kwartler/Harvard_DataMining_Business_Student