How Zoomdata Uses Apache Spark

Zoomdata leverages Spark in the following ways:

  • As a mechanism for result set caching
  • As a processing engine

Spark is also used as a data source within the Zoomdata environment by connecting to a Spark cluster using the SparkSQL connecter. For more information and steps on how to set up a connection, see Managing SparkSQL Connectors.

Zoomdata leverages Apache Spark as a processing layer for custom metrics, totals and pivots on results. Since Zoomdata pushes queries to the original data source, processes including aggregation, filtering and custom metrics are performed close to where data is stored. When aggregated, filtered result sets are retrieved from the source, this information is cached as data frames within Spark (also known as resilient distributed data sets, or RDDs). Whenever you submit new requests for data, Zoomdata retrieves the data from the Spark data cache whenever possible.

Zoomdata also uses cached result sets if the user sorts or crosstabs the data, or performs some kind of interaction that can be achieved without going back to the original source. If that particular data source does not have the capabilities needed for execution, then Spark is used to make up those differences in capabilities.

By default, Zoomdata provides an embedded Spark server that uses Spark version 2.2.