Home > Enterprise >  MAX RETURN QUESTION MARK IN PYTHON (PANDAS)
MAX RETURN QUESTION MARK IN PYTHON (PANDAS)

Time:10-23

I'm having a problem with my code

            data = pd.read_table('household_power_consumption.txt',sep=';', 
                                  low_memory=False,header=0, index_col=False,
                                  parse_dates=[0])
            df = pd.DataFrame(data, dtype=None)
            col = df["Global_active_power"]
            max_value=col.max()
            print(max_value)

This is an image of the dataset enter image description here

As you can see, the column "Global_active_power" is fully occupied with data. However, my max value return a question mark ("?")

I have tried several codes but the value stays the same. Can somebody help me with this


You can get the data from https://archive.ics.uci.edu/ml/datasets/individual household electric power consumption

CodePudding user response:

Probably not all of the rows in column Global_active_power are populated by numerical values. Probably the latest value in the row has missing value and there populated is "?". Because of that entire column is not numerical, and max() returns the latest value from the row.

Check example:

df = pd.DataFrame({"x": [1,2,3], "y": ["32.32", "?","fef"], "z": ["32.32", "456", "?"]})

df

# output
   x      y      z
0  1  32.32  32.32
1  2      ?    456
2  3    fef      ?


df.x.max()

# output
3

df.y.max()

# output
'fef'

df.z.max()

# output
'?'

If column is not numerical data type, max() all the time returns the latest value in the column.

CodePudding user response:

The data in the column, as imported, is str type -- so .max() isn't meaningful in the sense you intend. It appears that the data is floating-type, so you need to convert it to type float64, but first replace all the ? values with NaN (so the type conversion doesn't fail). That is, try:

col = df['Global_active_power'].apply(lambda x: x if x != '?' else 'NaN')
                               .astype('float64')

Working example:

import pandas as pd

data = pd.read_table('household_power_consumption.txt',\
                     sep=';',low_memory=False,\
                     header=0, index_col=False,parse_dates=[0])

df = pd.DataFrame(data,dtype=None)
col = df['Global_active_power'].apply(lambda x: x if x != '?' else 'NaN')\
                               .astype('float64')
print(col.max())

>>> 11.122
  • Related