When Doug Cutting created the Hadoop framework 10 years ago he never expected it to bring massive-scale computing to the corporate world.
“My expectations were more moderate than what we’ve seen, for sure,” he said, speaking at the Strata and Hadoop World conference.
Today Hadoop is used by many household names, helping Facebook analyse traffic for its more than 1.6 billion monthly users and aiding Visa in uncovering billions of dollars-worth of fraud.
The attraction of Hadoop lies in how it allows big data to be processed more cheaply and, in certain respects, more simply. The platform provides a group of technologies that allow very large datasets to be spread across large clusters of commodity servers and processed in parallel.
Yet there are limitations to what the platform can do. Today, the speed at which Hadoop clusters can process very large datasets is capped by the rate at which data is shuttled between secondary storage — SSDs or even slower spinning discs — and a computer’s memory and CPU.
This I/O bottleneck has arisen because processor speed and efficiency are increasing faster than storage read-write rates.
PUTTING A PETABYTE IN MEMORY
But now storage technology is poised to undergo a significant shift, one which Cutting said will help take the brakes off big data processing.
This year, Intel plans to release its 3D XPoint storage chips, which can retrieve data 1,000 times faster than the NAND flash typically used in SSDs, while also being able to store data at a density ten times greater than DRAM, which is the type of memory typically used today.
While XPoint will initially be offered as storage in the form of Optane-branded SSDs, Intel is planning to follow that up by releasing XPoint memory modules. Thanks to XPoint storing data at far higher densities than traditional DRAM, these modules will allow servers to have a much larger memory than is the norm today. Intel has talked about Intel Xeon servers being available next year with 6TB of memory, made up of a mixture of DDR4 DRAM and XPoint. That said XPoint won’t match DDR4 DRAM for performance. The seven microsecond latency and 78,000 read/write IOPS of pre-release XPoint SSDs is slower than DRAM and no more than 20 times faster than high-performance SSDs by some estimates.
Regardless, Cutting predicts the use of XPoint and other non-volatile memory in Hadoop clusters will open up the platform to new uses, allowing users to process much larger datasets in memory, which in turn will bypass the latency inherent in fetching data from disk.
“If you could have a petabyte of data in memory, accessible from any node within cycles, that’s several layers of magnitude performance improvement, if you’re doing certain kinds of algorithms,” said Cutting, who is now chief architect at Cloudera, which offers its own distribution of Hadoop.
“Things that are very expensive now, like graph operations, various sorts of iterative machine learning algorithms, clustering — things that have traditionally taken a very long time — can now be done very quickly and over pretty impressive amounts of data.
“There are still going to be datasets that are too big, and computations that are too slow but I think it will change things a lot,” he said, adding that latency related to network traffic would also be lessened by remote direct management access and Gigabit Ethernet switching.
Intel invested an estimated $740m in Cloudera in 2014. As part of the partnership between the two companies, Intel keeps Cloudera informed about new features and hardware in the pipeline, helping ensure this technology can be leveraged by Cloudera’s distribution of Hadoop.
“We want to make sure that the tools we provide can take advantage of this,” Cutting said about XPoint.
“We’ve worked very hard to minimise the CPU usage for accessing data structures that are already in memory,” he said, adding Cloudera had attempted to prevent unnecessary operations that would cause the CPU to bottleneck the processing of in-memory data.
HADOOP AND THE CLOUD
Cutting is also keen to make Hadoop available to a wider number of people by making it easier to spin up Hadoop clusters in the cloud.
It is already possible to build a Hadoop cluster on various cloud platforms. For instance, those running Cloudera’s Distribution of Hadoop (CDH), can use Cloudera Director to spin up clusters of virtual servers on Amazon Web Services and Google Cloud Platform.
However, there are still limitations that Cutting says need to be addressed to make the process easier, with Cloudera planning to improve support for feeding data from AWS S3 and other cloud-based storage to Hadoop’s data processing engines.
“There are some adaptations we need to make to Hadoop to make it able to work better in the cloud. We need to treat the storage, things like Amazon’s S3, as a first-class citizen, along with HDFS [Hadoop Distributed File System] for input and output, so that people can spin up clusters dynamically,” he said.
And with clusters more likely to be spun up and down in the cloud, Cutting says Cloudera also want to improve start-up times.
Another problem that Cutting would like to tackle, is making it easier to transfer a Hadoop cluster from one cloud platform to another, with Cutting dismayed at the current state of cloud lock-in.
“We think we could provide some real value in giving people portability between cloud providers. Right now, if you start developing your applications in the cloud you’re very rapidly locked into one cloud vendor.”
With its distribution of Hadoop, Cutting said Cloudera is in the process of building “a layer that lets people decide whether workloads go on-premise, or into Amazon, Google, Microsoft or other cloud provider.”
This functionality is available to some extent via Cloudera Director today, he said, and “it’s something that we’ll continue to push and make that more seamless”.
Looking further into the future of distributed systems, Cutting says an architecture is needed that can instantaneously consult both real-time and historical data to help make real-time decisions.
“There’s various ways of doing that today and they all have flaws. I think we will iron that one out soon.”
Ultimately, he believes Hadoop’s legacy will be to have played a role in making big data the norm, open-source the defacto choice for software and relegating relational databases to a niche.
“We won’t be talking about big data, we’ll be talking about data systems. The open source stack will no longer be the new thing, it will be the incumbent and the way people operate. Relational systems will be the Cobol-equivalent and very much legacy. We’ll have moved a long ways in 10 years.”
This article was originally published on www.zdnet.com and can be viewed in full


Archive
- October 2024(44)
- September 2024(94)
- August 2024(100)
- July 2024(99)
- June 2024(126)
- May 2024(155)
- April 2024(123)
- March 2024(112)
- February 2024(109)
- January 2024(95)
- December 2023(56)
- November 2023(86)
- October 2023(97)
- September 2023(89)
- August 2023(101)
- July 2023(104)
- June 2023(113)
- May 2023(103)
- April 2023(93)
- March 2023(129)
- February 2023(77)
- January 2023(91)
- December 2022(90)
- November 2022(125)
- October 2022(117)
- September 2022(137)
- August 2022(119)
- July 2022(99)
- June 2022(128)
- May 2022(112)
- April 2022(108)
- March 2022(121)
- February 2022(93)
- January 2022(110)
- December 2021(92)
- November 2021(107)
- October 2021(101)
- September 2021(81)
- August 2021(74)
- July 2021(78)
- June 2021(92)
- May 2021(67)
- April 2021(79)
- March 2021(79)
- February 2021(58)
- January 2021(55)
- December 2020(56)
- November 2020(59)
- October 2020(78)
- September 2020(72)
- August 2020(64)
- July 2020(71)
- June 2020(74)
- May 2020(50)
- April 2020(71)
- March 2020(71)
- February 2020(58)
- January 2020(62)
- December 2019(57)
- November 2019(64)
- October 2019(25)
- September 2019(24)
- August 2019(14)
- July 2019(23)
- June 2019(54)
- May 2019(82)
- April 2019(76)
- March 2019(71)
- February 2019(67)
- January 2019(75)
- December 2018(44)
- November 2018(47)
- October 2018(74)
- September 2018(54)
- August 2018(61)
- July 2018(72)
- June 2018(62)
- May 2018(62)
- April 2018(73)
- March 2018(76)
- February 2018(8)
- January 2018(7)
- December 2017(6)
- November 2017(8)
- October 2017(3)
- September 2017(4)
- August 2017(4)
- July 2017(2)
- June 2017(5)
- May 2017(6)
- April 2017(11)
- March 2017(8)
- February 2017(16)
- January 2017(10)
- December 2016(12)
- November 2016(20)
- October 2016(7)
- September 2016(102)
- August 2016(168)
- July 2016(141)
- June 2016(149)
- May 2016(117)
- April 2016(59)
- March 2016(85)
- February 2016(153)
- December 2015(150)