
What do Bloomberg, CapitalOne, and Comcast have in common? If you said they’re operationalizing data science using Apache Spark, then give yourself a gold star. These companies shared their stories of the upstart analytics toolbox during the recent Spark Summit East conference, and as the stories show, Spark is not only helping enterprises achieve analytic dreams, but they’re accelerating the development of Spark along the way.
If you’re a regular reader of this publication, then you already know about that Apache Spark is currently the hottest project in the big data analytics and data science community. The Hadoop distributors have long since jumped on the Spark bandwagon, and even IBM is now singing the praises of the free and open source distributed analytic framework that competes with so many of its proprietary offerings.
But as is the case with any new technology, the proof is whether customers adopt it. According to the presentations made at the Spark Summit East conference held in New York City two weeks ago, the adoption is large, and getting larger by the day.
Petabyte-Scale Analytics
Sridhar Alla, the director of EBI solutions architecture and a big data architect at cable giant Comcast, told the audience that the company has all sorts of big data analytics projects in play, including trying to learn more about customers by analyzing their clickstreams. But one of its more interesting ways it’s using Spark is to detect anomalies in its 30 million cable boxes, which generate more than 1 billion data points every day.
“At petabyte scale, the big difficulty is what to really focus on,” Alla told his audience at Spark Summit East. “You want me to look at power consumption of the amplifier? Fine. What about the radio frequency? Fine. What about color of the box? Maybe it has something to do with the chemicals in the paint that they’re using to paint the box? You never know. That’s the kind of problem you’re faced constantly.”
Comcast is performing anomaly detection for its set-top boxes by running Spark on a 400-node cluster that sports nearly a terabyte of RAM and 8 PB of raw storage. Putting that kind of horsepower behind Spark lets Alla’s large team of data scientists to explore the data and come up with solutions to real-world problems.
“It [Spark] is giving us more benefit than a lot of other models,” Alla said. “Not to say that other machine learning algorithms are not good. There are very specific vendors out there in particular areas that do a better job at specific algorithms…But in general we found it was very good here because we had a lot of data coming in real time versus batch. And…we do have data at petabyte scale.”
Analytics of a Higher Order
Spark also plays a role in the development of Bloomberg’s low-latency, cloud-based analytics platform, which is used to serve financial data to the company’s clients. By using a Spark concept called a DataFrame, the company is able to build higher order analytics, in which fresh calculations are based on older calculations, ad infinitum.
Spark is important because it eliminated a trade-off that previously bedeviled developers of analytic apps, Bloomberg architect Partha Nageswaran says. “Traditionally this problem has been solved by trading off latency for on-the-fly” calculations, Nageswaran says. “But with the advent of technology like Spark we can actually talk about these two concepts within the same sentence and make them happen.”
However, the rapid adoption of Spark also let some old problems creep back in. According to Nageswaran, Spark clusters started proliferating at Bloomberg about 1.5 years ago. “Almost always the standard mode of operation was to take a standard infrastructure, put a Spark cluster on top of it, and craft Spark applications,” he said. “This was great to get off the block. But you can see immediately the synthetic boundaries that are [created] by the various silos essentially make it harder, if not impossible, to do higher order compositional analytics on top of the output of each of the Spark applications.”
This led to the concept of “serverizing” Spark, which was the topic of the presentation by Nageswaran and fellow Bloomberg worker Sudarshan Kadambi. According to Nageswaran, by centralizing the Spark cluster and keeping the insights within a single system, it can eliminate the impedance to sharing caused by separate Spark clusters. The company developed a concept called Managed DataFrames to help it.
Spark Gone Viral
Bloomberg and Comcast are at the beginning of the wave of Spark adoption, but don’t be surprised if it continues. “This thing has gone so viral,” said Ovum analyst Tony Baer.
Baer is old enough to remember when IBM put $1 billion behind an open source operating system at a time when no self-respecting company would be caught dead running critical business systems on free software you could download from the Web. Eventually Linux came to become the dominant operating system, and Baer now sees a similar dynamic at play with Spark.
“The significance of the Spark open source project will have the same degree of impact on the enterprise as Linux did about 15 years ago.” There are good reasons for that, Baer says, citing the ease of programming and performance compared to MapReduce.
API consistency is another big reason why Spark is becoming widely adopted, not only by data scientists but software vendors, Baer says. That is creating new opportunities to build big data apps, particularly around machine learning.
“It wasn’t that you couldn’t do machine learning before Spark,” Baer says. “But if you had to do it through MapReduce, it might have taken hours,” compared to minutes or seconds with Spark. “So it makes a huge difference not just in your solution time but the quantity of results you can get.”
A Single Toolbox for Data Science
Forrester analyst Mike Gualtieri agrees that Spark is the real deal. “I’ve been looking at this big data phenomenon for a few years now and Spark is quite remarkable in what we’ve seen in the last year,” he said during his presentation at Spark Summit East.
The biggest driver of Spark adoption by the enterprise is the shift to advanced analytics. According to Forrester figures, the percentage of enterprises reporting that they’re doing some form of advanced analytics has jumped from 31 percent in 2014 to 48 percent in 2015. “There’s a lot of at least reported momentum for enterprise in using these various forms of advanced analytics,” he says. “Spark is one of the solutions that enterprise are going to choose to perform some of these analytics.”
In her keynote, Anjul Bhambhri, the VP of product development for IBM’s big data and analytics platform, says Spark is fast becoming the operating system for analytics, just as Linux became the dominant operating system for Web applications.
“Never before has such a rich set of analytical foundational capabilities come together in one platform in one stack,” she says. “Spark is really the single toolbox for analytics. If you have structured data, you use Spark SQL. For semi-structured data, Spark core. If you have data from a firehose, Spark Streaming. For building models, you have MLlib. For learning from graphs of data, there’s GraphX.
“The beauty of this is all of these components work together in a seamless manner,” she continues. “In the past if you needed these kinds of capabilities you needed half a dozen products…Today you need just one foundational platform, which is Apache Spark.”
This article was originally published datanami.com and can be viewed in full here


Archive
- October 2024(44)
- September 2024(94)
- August 2024(100)
- July 2024(99)
- June 2024(126)
- May 2024(155)
- April 2024(123)
- March 2024(112)
- February 2024(109)
- January 2024(95)
- December 2023(56)
- November 2023(86)
- October 2023(97)
- September 2023(89)
- August 2023(101)
- July 2023(104)
- June 2023(113)
- May 2023(103)
- April 2023(93)
- March 2023(129)
- February 2023(77)
- January 2023(91)
- December 2022(90)
- November 2022(125)
- October 2022(117)
- September 2022(137)
- August 2022(119)
- July 2022(99)
- June 2022(128)
- May 2022(112)
- April 2022(108)
- March 2022(121)
- February 2022(93)
- January 2022(110)
- December 2021(92)
- November 2021(107)
- October 2021(101)
- September 2021(81)
- August 2021(74)
- July 2021(78)
- June 2021(92)
- May 2021(67)
- April 2021(79)
- March 2021(79)
- February 2021(58)
- January 2021(55)
- December 2020(56)
- November 2020(59)
- October 2020(78)
- September 2020(72)
- August 2020(64)
- July 2020(71)
- June 2020(74)
- May 2020(50)
- April 2020(71)
- March 2020(71)
- February 2020(58)
- January 2020(62)
- December 2019(57)
- November 2019(64)
- October 2019(25)
- September 2019(24)
- August 2019(14)
- July 2019(23)
- June 2019(54)
- May 2019(82)
- April 2019(76)
- March 2019(71)
- February 2019(67)
- January 2019(75)
- December 2018(44)
- November 2018(47)
- October 2018(74)
- September 2018(54)
- August 2018(61)
- July 2018(72)
- June 2018(62)
- May 2018(62)
- April 2018(73)
- March 2018(76)
- February 2018(8)
- January 2018(7)
- December 2017(6)
- November 2017(8)
- October 2017(3)
- September 2017(4)
- August 2017(4)
- July 2017(2)
- June 2017(5)
- May 2017(6)
- April 2017(11)
- March 2017(8)
- February 2017(16)
- January 2017(10)
- December 2016(12)
- November 2016(20)
- October 2016(7)
- September 2016(102)
- August 2016(168)
- July 2016(141)
- June 2016(149)
- May 2016(117)
- April 2016(59)
- March 2016(85)
- February 2016(153)
- December 2015(150)