
Cloudera, the global provider of the fastest, easiest, and most secure data management and analytics platform built on Apache Hadoop and the latest open source technologies, today announced new advancements to further Hadoop as a mainstream platform for data science. Building on recent announcements around Apache Spark and Python that better enable data engineering and data science workloads across big data, Cloudera and Continuum Analytics are making it easier to work with the Python ecosystem through seamless integration of the Anaconda platform with Hadoop. In addition, Cloudera, together with the open source community, announced Apache Arrow, a new open source in-memory columnar data format, to support interoperability and improved performance of Python in the Hadoop ecosystem. These efforts will help data scientists to better take advantage of Hadoop using their preferred skills and tools, and lay the foundation for native data interchange and efficient performance for data engineering and machine learning workloads.
Improving the Python Experience for Data Scientists on Hadoop Python is the language of choice for data scientists and data engineers due to its power, elegance, and robust libraries and third-party integrations for expressing complex workflows. With frameworks like Apache Spark supporting Python, and new emerging tools like Ibis that better support Python natively for big data, Python has become an increasingly popular choice for data engineering and advanced analytics on Hadoop.
To make it easier for data scientists to get started with Python, Cloudera has partnered with Continuum Analytics – the creator and driving force behind Anaconda, a leading open source Python platform. The jointly-developed Anaconda for Cloudera packaging provides a simple, fast experience for customers installing Python, including popular packages such as NumPy, Pandas, and Scikit-Learn, on a Hadoop cluster. Users can deploy Anaconda seamlessly through Cloudera Manager and easily build and run Python-based solutions across Cloudera Enterprise, including under Spark.
“We are grateful to have worked with Cloudera to bring Anaconda to the Cloudera ecosystem,” states Peter Wang, chief technology officer and co-founder of Continuum Analytics. “The integration of Anaconda and Cloudera’s platform allows enterprises to realize the full potential of their data by making it easier to get started and distribute Anaconda across Hadoop clusters to support critical data science workloads.”
Additionally, Cloudera announced its community involvement with the new Apache Arrow project. Together with developers from Amazon, Databricks, Dremio, MapR, Trifacta, and Twitter, Cloudera is developing Arrow as a new in-memory columnar data structure to standardize in-memory processing and interchange across the ecosystem. Its efficient design will also accelerate analytic workloads across Hadoop frameworks (including Impala and Spark), and enable native interoperability for languages like Python and R for better data access and high-performance analytics.
“Cloudera has been paving the way for data scientists and engineers to become more deeply immersed in the Hadoop ecosystem,” said Wes McKinney, software engineer at Cloudera and the creator of Python pandas. “As the technology continues to mature, the vision of Python programmers leveraging the full-scale Hadoop ecosystem for complex data analysis becomes more tangible. We will continue to improve and expand data science capabilities across the platform, including ongoing development to make languages such as Python first-class citizens for the platform.”
These new advancements in making Hadoop more accessible and usable to the data science community are complemented by Cloudera’s recent development and leadership in this area, including:
● Spark MLlib in Cloudera 5.5: In the latest Cloudera Enterprise 5.5 release, Cloudera added Spark MLlib, broadening Spark’s ease of use and performance gains to machine learning applications within Hadoop. Cloudera also included Spark SQL extending the capabilities of Spark for developers and data scientists by allowing SQL to seamlessly embed within Spark applications.
● Ibis in Cloudera Labs: As a new open source project incubating in Cloudera Labs, Ibis is aimed at enabling advanced data analysis on a 100 percent Python stack and bringing a native Python experience to Hadoop at scale.
● SparkOnHBase in Cloudera Labs: Originating in Cloudera Labs and now committed to the Apache HBase 2.0 branch, SparkOnHBase provides more flexibility for building analytic applications that rely on Spark Streaming.
● Spark Runner for Apache Beam (incubating) in Cloudera Labs: Originating in Cloudera Labs and now part of the Beam SDK (formerly Google Dataflow), this project helps data scientists more easily build practical, massive-scale data processing pipelines for execution on Spark.
● Apache Spark Training: With unprecedented expertise and experience with Hadoop and its ecosystem, Cloudera brings a real-world approach to training and certifications for data scientists and developers to take full advantage of Spark as part of a complete Hadoop platform.
Enabling data scientists to leverage the full power of the Hadoop ecosystem means opening up new possibilities for enterprises looking to build faster, more intelligent data applications and predictive models that improve customer experiences and drive new revenue streams. Through this ongoing evolution, Cloudera is committed to offering seamless accessibility, productivity, and ease-of-use to the data science community.
This article was originally published by SCO Post and can be viewed in full here


Archive
- October 2024(44)
- September 2024(94)
- August 2024(100)
- July 2024(99)
- June 2024(126)
- May 2024(155)
- April 2024(123)
- March 2024(112)
- February 2024(109)
- January 2024(95)
- December 2023(56)
- November 2023(86)
- October 2023(97)
- September 2023(89)
- August 2023(101)
- July 2023(104)
- June 2023(113)
- May 2023(103)
- April 2023(93)
- March 2023(129)
- February 2023(77)
- January 2023(91)
- December 2022(90)
- November 2022(125)
- October 2022(117)
- September 2022(137)
- August 2022(119)
- July 2022(99)
- June 2022(128)
- May 2022(112)
- April 2022(108)
- March 2022(121)
- February 2022(93)
- January 2022(110)
- December 2021(92)
- November 2021(107)
- October 2021(101)
- September 2021(81)
- August 2021(74)
- July 2021(78)
- June 2021(92)
- May 2021(67)
- April 2021(79)
- March 2021(79)
- February 2021(58)
- January 2021(55)
- December 2020(56)
- November 2020(59)
- October 2020(78)
- September 2020(72)
- August 2020(64)
- July 2020(71)
- June 2020(74)
- May 2020(50)
- April 2020(71)
- March 2020(71)
- February 2020(58)
- January 2020(62)
- December 2019(57)
- November 2019(64)
- October 2019(25)
- September 2019(24)
- August 2019(14)
- July 2019(23)
- June 2019(54)
- May 2019(82)
- April 2019(76)
- March 2019(71)
- February 2019(67)
- January 2019(75)
- December 2018(44)
- November 2018(47)
- October 2018(74)
- September 2018(54)
- August 2018(61)
- July 2018(72)
- June 2018(62)
- May 2018(62)
- April 2018(73)
- March 2018(76)
- February 2018(8)
- January 2018(7)
- December 2017(6)
- November 2017(8)
- October 2017(3)
- September 2017(4)
- August 2017(4)
- July 2017(2)
- June 2017(5)
- May 2017(6)
- April 2017(11)
- March 2017(8)
- February 2017(16)
- January 2017(10)
- December 2016(12)
- November 2016(20)
- October 2016(7)
- September 2016(102)
- August 2016(168)
- July 2016(141)
- June 2016(149)
- May 2016(117)
- April 2016(59)
- March 2016(85)
- February 2016(153)
- December 2015(150)