Apache Spark, a Fast Engine for Large-scale Data Processing

Apache Spark is a fast open source distributed cluster computing framework. Apache Spark is a popular tool for big data analytics. Spark was designed to be used on a variety of architectures and with multiple programming languages. Spark  can run Wimdows, Linux, Mac OS and other UNIX-like systems. The software can run standalone or as a part of a cluster.

 

Spark provides an optimized in-memory data processing engine which can perform ETL, machine learning, analytics, and graph processing on big volumes of data and high-level APIs for such programming languages as SQL, R, Java, Python, and Scala. The system also supports a number of higher-level tools like Spark Streaming for performing streaming analytics, Graph X for graph processing, MLlib for machine learning, and Spark SQL for processing structured data and SQL.

 

Spark requires a distributed storage system and a cluster manager. For cluster management, Spark provides support for Apache Mesos, Hadoop YARN as well as for a native Spark cluster. As to distributed storage, Spark can integrate with a variety of commercial and open source data storage solutions such as MapR File System, Hadoop Distributed File System, Amazon 3S, Casandra, Kudu, OpenStack Swift and allows implementation of a custom solution.

 

Currently, the Spark project stack consists of Spark Core and 4 libraries that are optimized to meet the specific requirements of 4 different use cases. Applications typically require Spark Core that is the Spark engine and one of these libraries or a combination of libraries.

 

  • Spark Core provides management functions such as task dispatching and scheduling. It depends on the Resilient Programming Dataset, a programming abstraction that supports in-memory data storage.
  • MLlib is a scalable machine learning library that uses common statistics and machine learning algorithms.
  • Spark Streaming module ensures processing of streaming data and can be integrated with established data stream sources like Kafka and Flume.
  • Spark SQL works with structured data and supports workloads combining typical SQL database queries and analytics based on algorithms.
  • GraphX provides computation and analysis over graphs of data and includes many widely known algorithms, for example, PageRank.