Home > database >  Pandas how to count when string values converted to_numeric is greater than N?
Pandas how to count when string values converted to_numeric is greater than N?

Time:05-25

I have monthly dataframe (df) that is already in min - max ranges like the below:

Wind      Jan       Feb           Nov         Dec      calib
West     0.1-25.5   2.8-65.3    1.3-61.3    0.9-35.3     50
North    0.2-28.3   3.1-66.4    1.0-67.7    1.9-40.1     60
South    0.3-29.5   2.5-49.4    1.9-63.4    0.3-33.0     60
East     20.5       1.1-41.1    0.9-40.3      nan        50

I want to know the number of times the max wind speed was below the calib number each month. So I am trying to create a column Speed below calib (sbc) like below.

month_col = ['Jan', 'Feb', 'Nov', 'Dec']
df['sbc'] = (pd.to_numeric(df[month_col].str.extract(r"(?<=-)(\d \.\d )")) < df["calib"]).sum(axis=1)

The above code is not working and I am getting the error AttributeError: 'DataFrame' object has no attribute 'str'. How would I fix this?

CodePudding user response:

You can use melt:

sbc = (df.melt(['Wind', 'calib'], var_name='month')
         .assign(value=lambda x: x['value'].str.split('-').str[1].astype(float))
         .query('value < calib').value_counts('Wind'))
df['sbc'] = df['Wind'].map(sbc)

Output:

>>> df
    Wind       Jan       Feb       Nov       Dec  calib  sbc
0   West  0.1-25.5  2.8-65.3  1.3-61.3  0.9-35.3     50    2
1  North  0.2-28.3  3.1-66.4  1.0-67.7  1.9-40.1     60    2
2  South  0.3-29.5  2.5-49.4  1.9-63.4  0.3-33.0     60    3
3   East      20.5  1.1-41.1  0.9-40.3       NaN     50    2

Step by step:

  1. Reshape your dataframe
>>> out = df.melt(['Wind', 'calib'], var_name='month')
     Wind  calib month     value
0    West     50   Jan  0.1-25.5
1   North     60   Jan  0.2-28.3
2   South     60   Jan  0.3-29.5
3    East     50   Jan      20.5
4    West     50   Feb  2.8-65.3
5   North     60   Feb  3.1-66.4
6   South     60   Feb  2.5-49.4
7    East     50   Feb  1.1-41.1
8    West     50   Nov  1.3-61.3
9   North     60   Nov  1.0-67.7
10  South     60   Nov  1.9-63.4
11   East     50   Nov  0.9-40.3
12   West     50   Dec  0.9-35.3
13  North     60   Dec  1.9-40.1
14  South     60   Dec  0.3-33.0
15   East     50   Dec       NaN
  1. Extract max wind from range
>>> out = out.assign(value=lambda x: x['value'].str.split('-').str[1].astype(float))
     Wind  calib month  value
0    West     50   Jan   25.5
1   North     60   Jan   28.3
2   South     60   Jan   29.5
3    East     50   Jan    NaN
4    West     50   Feb   65.3
5   North     60   Feb   66.4
6   South     60   Feb   49.4
7    East     50   Feb   41.1
8    West     50   Nov   61.3
9   North     60   Nov   67.7
10  South     60   Nov   63.4
11   East     50   Nov   40.3
12   West     50   Dec   35.3
13  North     60   Dec   40.1
14  South     60   Dec   33.0
15   East     50   Dec    NaN
  1. Filter out rows and count
>>> out = out.query('value < calib').value_counts('Wind')
Wind
South    3
East     2
North    2
West     2
dtype: int64

Finally map (merge) this series to your original dataframe.

CodePudding user response:

.str is only available for Series, not DataFrame, you can stack/unstack:

month_numeric = (df[month_col].stack()
                   .str.extract(r"(?<=-)(\d \.\d )", expand=False)
                   .astype(float).unstack()
                 )
  • Related