Development, Deployment and In Between: UTF-8 Encoding

Sunday, November 6, 2016

UTF-8 Encoding - MySQL and Spark

Lets say your data is in Hebrew or other non-Latin language and you want to process it in Spark and store some of the results in MySQL. Cool... so you are setting the table charset and collate to UTF-8 either during the creation or by using ALTER to modify if already been created:

CREATE DATABASE name DEFAULT CHARACTER SET utf8 COLLATE utf8_bin;

CREATE TABLE table_name (column_name column_type CHARACTER SET utf8 DEFAULT NULL,...)

ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin;

but its not enough. You will need to set the MySQL JDBC client connection parameters

either by concatenating the following to the URL:

.... ?useUnicode=true&characterEncoding=UTF-8


Or by setting connection parameters for the dataframe we are going to write:
...
connProps.setProperty("characterEncoding", "UTF-8")
connProps.setProperty("useUnicode", "true")
resultsetDf.write.mode(saveMode).jdbc(mysqljdbcurl, tableName, connProps)


Good luck

2 comments:

IT said...: I like your post very much. It is very much useful for my research. I hope you to share more info about this. Keep posting Spark Online Training; October 22, 2019 at 1:57 AM
veera said...: Nice article,keep sharing more articles with us.

thank you...

big data online training; September 3, 2020 at 3:58 AM