I am trying to load data from s3 bucket to spectrum table but data is not showing correct. Could you please help me to resolve the below issue.
Example:
website
is one column where data is - www.calidadhh.com/\
but while loading into table it is taking other column value - www.calidadhh.com/Home Health Agency Corporation
I am running below query -
drop table schema.dhs_account_ovrvw;
create external table schema.dhs_account_ovrvw(
hospital_id INTEGER
,hospital_name VARCHAR(255)
,hq_address VARCHAR(255)
,hq_address1 VARCHAR(255)
,hq_city VARCHAR(255)
,hq_state VARCHAR(50)
,hq_zip_code VARCHAR(50)
,hq_county VARCHAR(255)
,website VARCHAR(20000)
,firm_type VARCHAR(510)
,rec_add_user_nm VARCHAR(20)
,rec_add_tms TIMESTAMP
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' WITH SERDEPROPERTIES ('quoteChar' = '\"','escapeChar' = '\\','separatorChar' = '\001')
STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3://*******/Refined_Layer/DHS_ACC_OVRVW/'
TABLE PROPERTIES ('skip.header.line.count' = '1');
Output it is generating -
hospital_id,hospital_name,hq_address,hq_address1,hq_city,hq_state,hq_zip_code,hq_county,website,firm_type,rec_add_user_nm,rec_add_tms
929719 Calidad Home Health & Hospice 1600 E Expy 83 La Feria TX 78559 TX - Cameron www.calidadhh.com/Home Health Agency Corporation CDP_USER 2021-11-22 07:39:57
Output should be -
hospital_id,hospital_name,hq_address,hq_address1,hq_city,hq_state,hq_zip_code,hq_county,website,firm_type,rec_add_user_nm,rec_add_tms
929719 Calidad Home Health & Hospice 1600 E Expy 83 La Feria TX 78559 TX - Cameron www.calidadhh.com/\ Home Health Agency Corporation CDP_USER 2021-11-22 07:39:57
CodePudding user response:
The problem looks to be that '' is being seen as the escape character (serde default). It is not clear what character is following the '' in your desired output (looks like a space) so it is confusing. If you change the escape character to something that is not in your data these characters ('\ ') should come through uninterpreted. I think adding
ESCAPECHAR='^'
to the SERDEPROPERTIES will get you closer.