I have a very large dataset (>1m entries), then I have a list of postcodes and I want to loop through the postcodes and create a list of matching output areas code from the dataset.
The dataset source: https://geoportal.statistics.gov.uk/datasets/06938ffe68de49de98709b0c2ea7c21a/about
The code:
import dask.dataframe as dd
df= dd.read_csv("PCD_OA_LSOA_MSOA_LAD_AUG19_UK_LU.csv")
zipcodes = ["AB1 5YP","AB1 7FH"]
oa11cd_output = []
for zipcode in zipcodes:
entry = df[df['pcds'] == postcode]
oa11cd_output.append(entry['oa11cd'])
however when I try to even print the entry, I do not get the actual row content but something that looks like this:
Name: oa11cd, dtype: object
Dask Name: getitem, 5 graph layers
dd.Scalar<size-ag..., dtype=int32>
Dask Series Structure:
npartitions=6
object
...
...
...
...
Any idea how to get the actual content? Thank you
CodePudding user response:
The encoding seems to be "iso-8859-1". On top of that the type inference does not work for two (of the) columns (in this particular file) so you have to force it. See code below:
import dask.dataframe as dd
df= dd.read_csv("PCD_OA_LSOA_MSOA_LAD_AUG19_UK_LU.csv", \
dtype={'doterm': 'float64', 'ladnmw': 'object'}, \
encoding="iso-8859-1")
zipcodes = ["AB1 5YP","AB1 7FH"]
oa11cd_output = []
for zipcode in zipcodes:
entry = df[df['pcds'] == zipcode].compute().to_dict()
oa11cd_output.append(entry['oa11cd'])
Some explanation on how I determined the encoding. First try was to do:
> file -i PCD_OA_LSOA_MSOA_LAD_AUG19_UK_LU.csv
PCD_OA_LSOA_MSOA_LAD_AUG19_UK_LU.csv: text/csv; charset=us-ascii
Ok, so file
says it is us-ascii. But file
does not read the whole file to make that evaluation (check -P parameter). I tried to increase how much file
reads but process went out of memory. Let's try to convert it:
> iconv -f US-ASCII -t UTF-8 PCD_OA_LSOA_MSOA_LAD_AUG19_UK_LU.csv > converted.csv
iconv: illegal input sequence at position 198311200
Ok, so clearly not us-ascii. Let's output a bit around that position:
> tail -c 198311200 PCD_OA_LSOA_MSOA_LAD_AUG19_UK_LU.csv| head -n 10 > problem_lines.txt
> file -i problem_lines.txt
problem_lines.txt: text/plain; charset=iso-8859-1
Problem solved!