Cassandra Logo
 
chan




Thursday, September 24
9:50 AM - 10:35 AM


Breakthrough OLAP performance on Cassandra and Spark
Apache Cassandra is rock-solid and widely deployed for OLTP and real-time applications, but it is typically not thought of as an OLAP database for analytical queries. This talk will show architectures and techniques for combining Apache Cassandra and Spark to yield a 10-1000x improvement in OLAP analytical performance. We will then introduce a new open-source project that combines the above performance improvements with the ease of use of Apache Cassandra, and compare it to implementations based on Hadoop and Parquet.

First, the existing Cassandra Spark connector allows one to easily load data from Cassandra to Spark. We'll cover how to accelerate queries through different caching options in Spark, and the tradeoffs and limitations around performance, memory, and updating data in real time. We then dive into the use of columnar storage layout and efficient coding techniques that dramatically speed up I/O for OLAP use cases. Cassandra features like triggers and custom secondary indexes allow for easy data ingestion into columnar format. Next, we explore how to integrate this new storage with Spark SQL and its pluggable data storage API. Future developments will enable extreme analytical database performance, including smart caching of column projections, a columnar version of Spark's Catalyst execution planner, and how vectorization makes for fast cache- and GPU-friendly calculations - see Spark's Project Tungsten.

FiloDB is a new open-source database using the above techniques to combine very fast Spark SQL analytical queries with the ease of use of Cassandra. We will briefly cover interesting use cases, such as:
* Easy exactly-once ingestion from Kafka for streaming and IoT applications
* Incremental computed columns and geospatial annotations. We'll discuss how FiloDB improves aggregations needed for choropleth maps over standard PostGIS solutions.

Evan Chan - TupleJump
Evan loves to design, build, and improve bleeding edge distributed data and backend systems using the latest in open source technologies. He has led the design and implementation of multiple big data platforms based on Storm, Spark, Kafka, Cassandra, and Scala/Akka, including a columnar real-time distributed query engine. He is an active contributor to the Apache Spark project, a Datastax Cassandra MVP, and co-creator and maintainer of the open-source Spark Job Server. He is a big believer in GitHub, open source, and meetups, and has given talks at various conferences including Spark Summit, Cassandra Summit, FOSS4G, and Scala Days.


        |        Code of Conduct        |        T&C        |        Privacy