5 Reasons Apache Spark is the Swiss Army Knife of Big Data Analytics
We are living in exponential times. Especially if we are talking about data. The world is moving fast, and more data is generated every day. With all that data coming your way, you need the right tools to deal with the growing amounts of data.
If you want to get any insights from all that data, you need tools that can process massive amounts of data quickly and efficiently. Fortunately, the Big Data open source landscape is also growing rapidly and more, and more tools come to the market to help you with this. One of these open source tools that is making fame at this moment is Apache Spark.
This week I was invited to join the IBM |Spark analyst session and the Apache™ Spark Community Event in San Francisco, where the latest news was shared on Apache Spark.
IBM announced their continuous contribution to the Apache Spark community. At the core of this commitment, IBM wants to offer Spark as a Service on the IBM Cloud as well as integrate Spark into all of its analytics platforms. They have also donated the IBM SystemML machine learning to the Spark open source ecosystem, allowing the Spark community to benefit from the powerful SystemML technology. Also, IBM announced the Spark Technology Centre and the objective to educate over a million data scientists and engineers on Spark.
Apache Spark is currently very hot, and many organisations are using it to do more with their data. It can be seen as the Analytics Operating System, and it is the potential to disrupt the Big Data ecosystem. It is an open source tool with over 400 developers contributing to Spark. Let me explain what Apache Spark and show you how it could benefit your organisation:
Apache Spark Enables Deep Intelligence Everywhere
Spark is an open source tool that was developed in the AMPLab at UC Berkeley. Apache Spark is a general-purpose engine for large-scale data processing, up to 1000s of nodes. It is an in-memory distributed computing engine that is highly versatile to any environment. This enables users and developers to build models quickly, iterate faster and apply deep intelligence to your data across your organisation.
Spark’s distinguishing feature is its Resilient Distributed Datasets (RDDs). This feature allows collections of objects to be stored in memory or disk across a cluster, which automatically rebuilds on failure. Its in-memory primitives offer up to 100 times faster performances, contrary to the two-stage, disk-based MapReduce paradigm. It, therefore, addresses several of the MapReduce challenges.
Spark lets data scientists and developers work together on a unified platform. It enables developers to essentially execute Python or Scala code across a cluster instead to one machine. Users can load data into a cluster’s memory, and they can query it repeatedly. Basically, Spark is an advanced analytics tool that is very useful for machine learning algorithms because of these clusters.
Spark is very well suited for the Big Data era, as it supports the rapid development of Big Data applications. Code can easily be reused across batch, streaming and interactive applications.
According to a CrowdChat hosted by IBM the week before the analyst session, some important features of Spark implementation were discussed, including:
- Real-time querying of your data;
- High-speed stream processing of low latency data;
- Clear separation of importing data and distributed computation;
- Spark’s libraries, including Spark SQL, Spark Streaming, the MLlib machine learning library, and GraphX;
- A large community and support for Spark from major vendors including IBM, MapR, Cloudera, Intel, Hortonworks and many other Big Data platforms.
5 Use Cases of Apache Spark
According to Vaibhav Nivargi, early adopters of Spark by sector include consumer-packaged goods (CPG), insurance, media and entertainment, pharmaceuticals, retailers, automotive.
However, there are also multiple use cases for Spark where high-velocity, high-volume, constant streams of (un)structured data are generated, most of which will be machine data. Use cases include fraud detection, log processing as part of IT Operations Analytics, sensor data processing and of course data related to the Internet of Things.
Experts believe that Spark is likely to evolve into the preferred tool for high-performance Internet of Things applications that will eventually generate Petabytes of data across multiple channels. Apache Spark is already used by multiple companies, and some examples of these implementations are the following:
1. ClearStory Data
ClearStory Data uses Apache Spark as the basis for their harmonisation. Spark enabled them to create a visualisation tool that allows users to slice and dice massive amounts of data while visualisations adjust instantly. This allows their users to collaborate on the data without any delay.
Another use case of Apache Spark, in particular, Spark Streaming, is that for Conviva. Conviva is one of the largest video companies in the world, processing more than 4 billion video streams per month. To achieve this, they dynamically select and optimise sources to deliver the highest playback quality.
Spark Streaming has enabled them to learn the different network conditions in real-time and feed this directly into the video player. This allows them to optimise the streams and ensure that all four billion videos receive the correct amount of buffering.
Yahoo is using Apache Spark already for some time. They have several projects running with Spark. One of them uses Spark to offer the right content for the right visitor, i.e. personalisation. Machine learning algorithms determine individual visitors’ interests to give them the right news when they visit Yahoo! These same algorithms also help to categorise news stories when they arise.
Personalization is in fact very difficult, and it requires high-speed, (near) real-time processing power to understand a profile of a visitor upon entering your website. Apache Spark helps in this process.
The telecommunications division of Huawei uses the Machine Learning libraries of Spark to analyse massive amounts of network data. This network data can show traffic patterns that reveal subpaths or routers within the network that are congested and that slow down the traffic. When they have identified the hotspots in the network, they can add hardware to solve the congestion.
RedRock was developed during Spark Hacker Days. This hackathon is organised among IBM employees, and within ten days they created 100 innovations with Spark. One of these innovations is RedRock, which is currently in private Alpha.
RedRock is a Twitter analysis tool that allows the user to search and analyse Twitter to get insights such as categories, topics, sentiment and geographics. It lets the user act on data-driven insights discovered from Twitter.
It was developed following the IBM design thinking framework. This is a cyclical process of understanding, exploring, prototyping and evaluating. RedRock uses a 19-node spark cluster with 3.3 Terabyte in-memory Twitter data. Thanks to Spark, RedRock can analyse and visualise this amount of data within seconds based on any query.
Five Ways Spark Improves Your Business
Apache Spark already has a lot of traction, and with more companies partnering up and using Spark, it is likely to see that it is going to evolve in something really big. This is quite understandable if you look at the different advantages Spark can offer your business:
1. Spark is the right tool for analytic challenges that demand low-latency in-memory machine learning and graph analytics. This is especially relevant for companies that focus on the Internet of Things.
Since we will see the Internet of Things popping up in every imaginable industry in the coming years, Spark will enable organisations to analyse all that data coming from IoT sensors as it can easily deal with continuous streams of low-latency data. This will enable organisations to create real-time dashboards to explore their data and to monitor and optimize their business.
2. Spark will drastically improve Big Data Scientist productivity. It enables faster, iterative product development, using popular programming languages. The high-level libraries of Spark, including streaming data, machine learning, support for SQL queries, and graph processing, can be easily combined to create complex workflows.
This enables data scientists to create new Big Data applications faster. In fact, it requires 2-5x less code. It will result in reduced time-to-market for new products as well as faster access to insights within your organisation using these applications.
Spark also enables data scientists to prototype solutions without the requirement to submit code to the cluster every time, leading to better feedback and iterative development.
3. According to James Kobielus, the Internet of Things “may spell the end of data centres as we’ve traditionally known them”. In the coming years, we will see that most of the core functions of a data centre, storage and processing, can be done decentralised. i.e. it can be done directly on the IoT devices instead of a centralised data centre. This is called Fog Computing and was recently described by Ahmed Banafa on Datafloq.
Fog computing will give organisations unprecedented computing power at relatively low costs. Apache Spark is very well suited for the analysis of massive amounts of highly distributed data. Apache Spark could, therefore, potentially, be the catalyst required to have fog computing take off and to prepare organisations for the Internet of Things. This, in turn, could enable organisations to create new products and applications for their customers, creating new business models and revenue streams.
4. Spark’s framework is developed on top of the Hadoop Distributed File System. So this is a major advantage for those who are already familiar with Hadoop and already have a Hadoop cluster within their organisation. It, therefore, works well with Hadoop, using the same data formats and adhering to the data locality for efficient processing. It can be deployed on existing Hadoop clusters or work side-by-side. This allows organisations that already work with Hadoop to easily update their Big Data environment.
5. There is a large community contributing around Spark, with over 400 developers from around the world contributing and a long list of vendors supporting Spark. Combined with the compatibility with popular programming languages it offers organisations a large pool of developers that can work with Spark. Instead of having to hire expensive programmers because of an unknown tool or programming language, Spark is easy to use because of the libraries and the compatibility with Java, Python and Scala.
The Swiss Army Knife of Big Data Analytics
Spark is a powerful open-source data analytics, cluster-computing framework. It has become very popular because of its speed, iterative computing and better data access because of it’s in-memory caching. Its libraries enable developers to create complex applications faster and better, enabling organisations to do more with their data.
Because of its wide range of applications and its easy use to work with, Spark is also called the Swiss army knife of Big Data Analytics. And with the buzz already happening around Spark, the large community supporting Spark and the multiple use cases we already have seen, Spark could evolve into the next big thing within the Big Data ecosystem.
Image: Johan Swanepoel/Shutterstock