
This article was originally published by nextplatform.com and can be viewed in full here
There are few more widely recognized names in modern database research than Dr. Joseph Hellerstein. The Berkeley professor and Trifacta co-founder has spawned new approaches to relatively old problems on the programmatic and database design and implementation fronts.
Well before the tech world was awash in tales of “big data” woes, Hellerstein and teams were looking ahead at the future problems of data manipulation, transformation, and visualization, which culminated in the Wrangler project, which matched data manipulation and visualization tools with several new layers of automation and flexibility. At the time, around 2011, these allowed additional capabilities in terms of what databases could do—and just as important, the focus on performance made sure it could handle it all faster and more efficiently.
For anyone that has followed news about the open source Wrangler project Hellerstein and collaborators from Stanford development and how that fed into startup, Trifacta, it will be clear that the work had value outside of research contexts. And chances are, if you’ve been following Trifacta beyond research, it’s likely because the company has scored an incredible amount of funding since its launch ($76 million, including last week’s most recent influx of $35 million) and has notable use cases across a large swath of the Fortune 500 with companies like Time Warner, Intel, Thomson Reuters, Dow, Capital One, and many others climbing on board with their approach to data transformation, preparation, and exploration. What is notable here is that in an ecosystem that is so crowded with analytics, visualization, and data staging vendors, Trifacta has managed not only to stand out—but to stand alone. And in a relatively short amount of time to boot.
With a start in 2012, the company relied on strategic partnerships with all three of the leading Hadoop distribution vendors, as well as a number of other data source providers and SaaS providers. The research basis, as noted previously, was on considering the different ways people were interacting with data—and that almost 80% of that time was spent simply getting data in shape before it could be analyzed. As Hellerstein tells The Next Platform, “as we talked to people in the field, a large percentage of their day was being spent manipulating data so running an analysis, whether it was clustering or machine learning or something simpler, was often prefaced by many hours of transforming the data to get it into shape so it could be plugged into these algorithms. So even then before we started, we were finding the thing that presented the most interesting research problem was also the problem that represented the lion’s share of the workload.”
At the time Trifacta got its start, the main way to tackle the problem of transforming and preparing data was to write programs. By nature, technical and laborious, it had the added weight of lacking in intuitiveness. “This approach doesn’t provide the right intuitive feedback while you’re doing it—you’re not seeing what’s happening to the data; you’re manipulating programming statements that don’t align to the real structure and content of it.” The other way is via a graphical programming approach where icons can be connected on a canvas to build a data flow graph. This moves things up to a slightly high level, which is valuable, but it is still abstract and disconnected with the data and problems one is trying to solve as well, Hellerstein says. “With these approaches, you’re not seeing actual data, but a description of what you’ll do with the data. It’s all programming in the abstract. That was the state of the art when we started working on these problems.”
Even with the emerging approaches, existing practices, including spreadsheets, offered some early inspiration. With spreadsheets, the data is right there all the time and can be directly manipulated. This is good at small scale, but moving it over to a larger dataset becomes yet another challenge as it doesn’t translate. Hellerstein and team decided to take a best of all world approach and blend that direct manipulation benefit with more specific features and interfacing options that were cropping up in data flow graphs and other areas. The ultimate result of this work was something Trifacta calls “predictive interaction” which blends these capabilities and moves along with the user, learning from the process under the covers.
The concept is no so unlike Google’s search function, which predicts what you might be looking for in the text entry window. “When you enter anything on screen, whether you’re looking at a table, a bar chart, we’re translating that to cards at the bottom of the screen that contain a visualization of what the outcome might be, along with a rank order list of the operations that might be needed so users can refine them.” While Hellerstein didn’t give much of a look at the machine learning algorithms that underlie this, it is worth mentioning that the problem being targeted—the very lengthy time required for data transformation—can be significantly reduced in this manner, just as Google searches are (and would be noticeably so if done in great volume throughout the day).
These first functions, which have since been honed in Trifacta’s offerings, as users have been able to highlight features in their data, get suggestions, and move forward faster. There is no automatic algorithm to magically transform and clean data, but this approach can be considered as “having the algorithms participate” in the process by speeding it along in a visually interactive mode.
In practice, one might receive a massive file of data that looks, in the raw, like a bunch of garbled text. The manual challenge would have been to find out where this fits into rows and columns—what the structure of that data might be. However, with the predictive weight of a pattern recognition algorithm offered as a suggestion, the system can layer on the patterns, refining that data into where the natural rows and columns are, then how further patterns in even one row or column should fit together. It is difficult to express how much time this might save, particularly at the beginning stages of being handed a raw file of nonsense and making it all fit together.
It is not always correct the first time, Hellerstein says, pointing again to the algorithm as a “participant” versus automatic tool for data transformation, but there are many suggestions and algorithmic approaches that can leverage various machine learning approaches to help users get to where they need to be. With the human and computer interacting more closely, in essence, and with the human having ultimate control and understanding of the patterns being found and their relationship to the problem, that time-consuming task of data transformation can be cut down dramatically.
The ease of use is another feature that Trifacta is trumpeting, especially as its user base has grown. Putting that kind of analytical power into an ever-growing set of non-specialists and programmers means the ability to discover connections is more democratized. And after all, isn’t data democratization part of what this whole “big data” craze has been about?
For a company like Lockheed Martin, which has been lending its computer science skills to the Centers for Medicare and Medicaid Services to combat fraud, sifting through claims data meant analyst teams needed to standardize on a platform and then work through several disparate data types. The transformation process, according to Trifacta, took six weeks initially—a time they were able to reduce down to a day. Storage and technology vendor, EMC, was stuck building complex scripts to prepare data based on performance, maintenance and other data from their products at customer sites for analysis. Teams there were burdened by this challenge, both in terms of people and time. Pepsi Co, LinkedIn, RBS, and a number of others, who had either been doing transformation using custom scripts or other manual approaches, have moved to Trifacta and apparently, those success stories are working. The company is adding more people around the world, and armed with the additional funding, is set to be one of the great data-driven success stories of the year—if it wasn’t already for 2015.


Archive
- October 2024(44)
- September 2024(94)
- August 2024(100)
- July 2024(99)
- June 2024(126)
- May 2024(155)
- April 2024(123)
- March 2024(112)
- February 2024(109)
- January 2024(95)
- December 2023(56)
- November 2023(86)
- October 2023(97)
- September 2023(89)
- August 2023(101)
- July 2023(104)
- June 2023(113)
- May 2023(103)
- April 2023(93)
- March 2023(129)
- February 2023(77)
- January 2023(91)
- December 2022(90)
- November 2022(125)
- October 2022(117)
- September 2022(137)
- August 2022(119)
- July 2022(99)
- June 2022(128)
- May 2022(112)
- April 2022(108)
- March 2022(121)
- February 2022(93)
- January 2022(110)
- December 2021(92)
- November 2021(107)
- October 2021(101)
- September 2021(81)
- August 2021(74)
- July 2021(78)
- June 2021(92)
- May 2021(67)
- April 2021(79)
- March 2021(79)
- February 2021(58)
- January 2021(55)
- December 2020(56)
- November 2020(59)
- October 2020(78)
- September 2020(72)
- August 2020(64)
- July 2020(71)
- June 2020(74)
- May 2020(50)
- April 2020(71)
- March 2020(71)
- February 2020(58)
- January 2020(62)
- December 2019(57)
- November 2019(64)
- October 2019(25)
- September 2019(24)
- August 2019(14)
- July 2019(23)
- June 2019(54)
- May 2019(82)
- April 2019(76)
- March 2019(71)
- February 2019(67)
- January 2019(75)
- December 2018(44)
- November 2018(47)
- October 2018(74)
- September 2018(54)
- August 2018(61)
- July 2018(72)
- June 2018(62)
- May 2018(62)
- April 2018(73)
- March 2018(76)
- February 2018(8)
- January 2018(7)
- December 2017(6)
- November 2017(8)
- October 2017(3)
- September 2017(4)
- August 2017(4)
- July 2017(2)
- June 2017(5)
- May 2017(6)
- April 2017(11)
- March 2017(8)
- February 2017(16)
- January 2017(10)
- December 2016(12)
- November 2016(20)
- October 2016(7)
- September 2016(102)
- August 2016(168)
- July 2016(141)
- June 2016(149)
- May 2016(117)
- April 2016(59)
- March 2016(85)
- February 2016(153)
- December 2015(150)