Home > Net >  I am using dask dataframe to read csv file which quite large. I want to extract some specific column
I am using dask dataframe to read csv file which quite large. I want to extract some specific column

Time:05-09

I have csv file about 3GB large I want to read it with dask. and I want to perform an operation on this data which is to select some columns which contain a specific data.

For example:

I want to get all the ids which are in df

ids =  ['SW00003062', 'SW00003063', 'SW00003067', 'SW00003072']

from this dask dataframe:

dask dataframe

Simply get the dataframe which contains the id of ids list

CodePudding user response:

what about this

import pandas
random_name = pandas.read_csv("insert file name")
random_name["column title"]  #this should give you your column of choice
list = random name["column title"].to_list()  #turns column to list

CodePudding user response:

Dask has very similar syntax to pandas, that means, most of the pandas command are supported. For your requirement you can do the following:

df = pd.DataFrame(
    {
        'size': random.random(size=100),
        'label': random.choice(['a', 'b', 'c'], size=100, replace=True)
    }
)

dask_df = dd.from_pandas(df, npartitions=2)

# to get the unique values from a column
dask_df.label.unique().compute()

# to index a column based on condition
dask_df.loc[dask_df.label == 'a', :].compute()

Note that, dask stores an operation as a task, to actually execute those tasks one has to call the compute() function on expressions.

  • Related