Sunday, November 6, 2016

UTF-8 Encoding - MySQL and Spark

Lets say your data is in Hebrew or other non-Latin language and you want to process it in Spark and store some of the results in MySQL. Cool... so you are setting the table charset and collate to UTF-8 either during the creation or by using ALTER to modify if already been created:

CREATE DATABASE name DEFAULT CHARACTER SET utf8 COLLATE utf8_bin;
CREATE TABLE table_name (column_name column_type CHARACTER SET utf8 DEFAULT NULL,...) 
ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin;

but its not enough. You will need to set the MySQL JDBC client connection parameters 
either by concatenating the following to the URL: 
.... ?useUnicode=true&characterEncoding=UTF-8

Or by setting connection parameters for the dataframe we are going to write:
...
connProps.setProperty("characterEncoding", "UTF-8")
connProps.setProperty("useUnicode", "true")
resultsetDf.write.mode(saveMode).jdbc(mysqljdbcurl, tableName, connProps)


Good luck



Query that returns a large ResultSet using Hive JDBC takes ages to complete

You are trying to execute a query that the size of its result set is huge and its execution time using the beeline CLI is fine. The Hive and Spark logs don't show any errors but you might see lots of Kryo messages in the Hive debug logs - in such cases its highly recommended to start Hive using the following command:
hiveserver2 --hiveconf hive.root.logger=DEBUG,console
It usually happens because of the Kryo serialization/deserialization process time in case you have configured Hive on Spark.
In such cases I recommend executing the query using Spark so the end-to-end process is much faster and equals to the beeline execution time.

Good luck

Thursday, November 3, 2016

Initiating a SparkContext throws javax.servlet.FilterRegistration SecurityException


If you are trying to load hive-jdbc, hadoop-client and jetty all in the same Scala project along with your Spark dependencies, you might not be able to load a standalone Spark application. 
While trying to initiate the SparkContext, it will throw a javax.servlet.FilterRegistration SecurityException because of mixed javax.servlet dependencies imported with different versions from several sources. 

How to avoid this conflict?
You will need to add several ExclusionRules to some of your dependencies located at the build.sbt file:

libraryDependencies += "org.apache.hadoop" % "hadoop-client" % "2.6.4" 
excludeAll(ExclusionRule(organization = "javax.servlet"), 
ExclusionRule(organization = "org.mortbay.jetty"))

libraryDependencies += "org.apache.hive" % "hive-jdbc" % "1.2.1" 
excludeAll ExclusionRule(organization = "javax.servlet")

Good luck