I am struggling with the following error: TypeError: can only concatenate str (not "int") to str. It occurs when I'm trying to calculate correlation in steps 6 and 7.
Although I am aware that int cannot be concatenated to str, I have no clue how this is related to my code (guess something's wrong with data types). Python and pandas is still fairly new to me. I would be grateful for any clues, because at this point I am really out of ideas.
import pandas as pd
#-1----------------------------------------------------------------------------
#Load the file
input_file = 'GBPUSD_H4.csv'
data = pd.read_csv(input_file).head(2500)
#-2----------------------------------------------------------------------------
#Delete columns SMA14IND and SMA50IND.
data.drop(['SMA14IND','SMA50IND'], inplace=True, axis=1)
#-3----------------------------------------------------------------------------
#I've noticed there are some values in SMA14 and SMA50 like nan5, nan6, nan7.
#Thought that might be the source of errors. Not sure how to get rid off them
#in the most efficient manner so I simply changed them to 0s, as below:
for el in data['SMA14']:
if (type(el) != int) and (type(el) != float) and (el[0:3] == 'nan'):
index = data[data['SMA14'] == el].index.item()
data.at[index, 'SMA14'] = 0
for el in data['SMA50']:
if (type(el) != int) and (type(el) != float) and (el[0:3] == 'nan'):
index = data[data['SMA50'] == el].index.item()
data.at[index, 'SMA50'] = 0
#-4----------------------------------------------------------------------------
#Interpolate the following columns:
data['Close'].interpolate(inplace=True)
data['SMA14'].interpolate(inplace=True)
data['SMA50'].interpolate(inplace=True)
#-5----------------------------------------------------------------------------
#Change nans to 0s.
for col in ['Bulls', 'CCI', 'DM', 'OSMA', 'RSI', 'Stoch', 'Decision']:
if data[col].isna().sum() > 0:
data[col].fillna(0, inplace=True)
#-6----------------------------------------------------------------------------
#Find correlation between SMA14 and SMA50.
corr_SMA14_SMA50 = round(data['SMA14'].corr(data['SMA50']),2)
print('Corr for SMA14 and SMA50:', corr_SMA14_SMA50)
#-7----------------------------------------------------------------------------
#Find correlation betwen Close and SMA14.
corr_Close_SMA14 = round(data['Close'].corr(data['SMA14']),2)
print('Corr for Close and SMA14:', corr_Close_SMA14)
#Find correlation betwen Close and SMA50.
corr_Close_SMA50 = round(data['Close'].corr(data['SMA50']),2)
print('Corr for Close and SMA50:', corr_Close_SMA50)
Dataset is available here: https://drive.google.com/file/d/1Xruk__mpPx8AknR6lvPlpMVRv-mgg3H-/view?usp=sharing
CodePudding user response:
With a little help I was able to solve this, so I am posting an answer. Maybe someone will find it useful.
The problem was indeed related to datatypes. It was caused by non-numeric values in columns SMA14 and SMA50 (those were like 'nan5', 'nan6' etc, not just 'nan'). When checking data.info(), it stated that dtype for mentioned columns are 'Objects'. And we needed them to be float64 instead.
So here what needed to be done.
1). Fix the amounts ('nan5', 'nan6' etc)
to_fix = ('SMA14', 'SMA50')
for col in to_fix:
for el in data[col]:
if (type(el) != int) and (type(el) != float) and (el[0:3] == 'nan'):
index = data[data[col] == el].index.item()
data.at[index, col] = 0
I know there are other ways of doing it, but I've ran into troubles and ultimately what worked for me, was the above. Of course, it might be that I was doing something wrong.
2). Change the datatypes - and this is what I was missing when I asked the question.
data['SMA14'] = data['SMA14'].astype(float)
data['SMA50'] = data['SMA50'].astype(float)