Home > Back-end >  Trying to filter a CSV file with multiple variables using pandas in python
Trying to filter a CSV file with multiple variables using pandas in python

Time:08-10

import pandas as pd
import numpy as np
df = pd.read_csv("adult.data.csv")

print("data shape: " str(data.shape))
print("number of rows: " str(data.shape[0]))
print("number of cols: " str(data.shape[1]))
print(data.columns.values)

datahist = {}
for index, row in data.iterrows():
    k = str(row['age'])   str(row['sex'])   
str(row['workclass'])   str(row['education'])   
str(row['marital-status'])   str(row['race'])
    if k in datahist:
        datahist[k]  = 1
    else:
        datahist[k] = 1
uniquerows = 0
for key, value in datahist.items():
    if value == 1:
        uniquerows  = 1
print(uniquerows)

for key, value in datahist.items():
    if value == 1: 
        print(key)

df.loc[data['age'] == 58] & df.loc[data['sex'] == Male]

I have been trying to get the above code to work.

I have limited experience in coding but it seems like the issue lies with some of the columns being objects. The int64 columns work just fine when it comes to filtering.

Any assistance will be much appreciated!

CodePudding user response:

df.loc[data['age'] == 58] & df.loc[data['sex'] == Male]

Firstly you are attemping to use Male variable, you probably meant string, i.e. it should be 'Male', secondly observe [ and ] placement, you are extracting part of DataFrame with age equal 58 then extracting part of DataFrame with sex equal Male and then try to use bitwise and. You should probably use & with conditions rather than pieces of DataFrame that is

df.loc[(data['age'] == 58) & (data['sex'] == 'Male')]

CodePudding user response:

The int64 columns work just fine because you've specified the condition correctly as:

data['age'] == 58

However, the object column condition data['sex'] == Male should be specified as a string:

data['sex'] == 'Male'

Also, I noticed that you have loaded the dataframe df = pd.read_csv("adult.data.csv"). Do you mean this instead?

data = pd.read_csv("adult.data.csv")

The query at the end includes 2 conditions, and should be enclosed in brackets within the square brackets [ ] filter. If the dataframe name is data (instead of df), it should be:

data.loc[ (data['age'] == 58]) & (data['sex'] == Male) ]
  • Related