Sunday, November 6, 2016

UTF-8 Encoding - MySQL and Spark

Lets say your data is in Hebrew or other non-Latin language and you want to process it in Spark and store some of the results in MySQL. Cool... so you are setting the table charset and collate to UTF-8 either during the creation or by using ALTER to modify if already been created:

CREATE DATABASE name DEFAULT CHARACTER SET utf8 COLLATE utf8_bin;
CREATE TABLE table_name (column_name column_type CHARACTER SET utf8 DEFAULT NULL,...) 
ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin;

but its not enough. You will need to set the MySQL JDBC client connection parameters 
either by concatenating the following to the URL: 
.... ?useUnicode=true&characterEncoding=UTF-8

Or by setting connection parameters for the dataframe we are going to write:
...
connProps.setProperty("characterEncoding", "UTF-8")
connProps.setProperty("useUnicode", "true")
resultsetDf.write.mode(saveMode).jdbc(mysqljdbcurl, tableName, connProps)


Good luck



No comments: