How can I see a list of the variables in a CSV column?-CodePudding

I have a csv file with over 5,000,000 rows of data that looks like this (except that it is in Farsi):

Contract Code,Contract Type,State,City,Property Type,Region,Usage Type,Area,Percentage,Price,Price per m2,Age,Frame Type,Contract Date,Postal Code
765720,Mobayee,East Azar,Kish,Apartment,,Residential,96,100,570000,5937.5,36,Metal,13890107,5169614658
766134,Mobayee,East Azar,Qeshm,Apartment,,Residential,144.5,100,1070000,7404.84,5,Concrete,13890108,5166884645
766140,Mobayee,East Azar,Tabriz,Apartment,,Residential,144.5,100,1050000,7266.44,5,Concrete,13890108,5166884645
766146,Mobayee,East Azar,Tabriz,Apartment,,Residential,144.5,100,700000,4844.29,5,Concrete,13890108,5166884645
766147,Mobayee,East Azar,Kish,Apartment,,Residential,144.5,100,1625000,11245.67,5,Concrete,13890108,5166884645
770822,Mobayee,East Azar,Tabriz,Apartment,,Residential,144.5,50,500000,1730.1,5,Concrete,13890114,5166884645

I would like to have a code to list the variables in a specific column. For example, I'd like it to return {Kish, Qeshm, Tabriz} for the 'city' column.

CodePudding user response：

You need to first to import the csv module into your python file and read over each row in the file and save it in a list, so it'll be like

import csv

cities = []
with open("yourfile.csv", "r") as file:
    reader = csv.DictReader(file)  //This will save the values in the very top of the csv file as header so it will skip a line
    for row in reader:
        city = row["City"]
        cities.append(city)

this will give you a list of cities=[Kish, Qesh, Tabriz, ....]

CodePudding user response：

It appears you want to remove duplicates as well, which you can have by simply cast the finished list to set. Here's how to do it with pandas:

import pandas as pd
cities = pd.read_csv('yourfile.csv', usecols=['City'])['City']

# just cast to list if you want a plain list instead of a DataFrame
cities_list = list(cities)

# use set to remove the duplicates
unique_cities = set(cities)

In case you have need to preserve ordering, you might use an ordered dict with just keys.

Also, in case you're short on memory trying to read 5M rows in one go, you can read them in chuncks:

import pandas as pd
cities_chunks_list = [chunck['City'] for chunck in pd.read_csv('yourfile.csv', usecols=['City'], chunksize = 1000)]

#let's flatten the list
cities_list = [city for cities_chunk in cities_chunks_list for city in cities_chunk]

Hope I helped.