
In this special guest feature, Diana Shealy, a data scientist at Treasure Data, highlights several commonly overlooked best practices when building out an analytics program in any organization, outlines common data-science problems and offer solutions to help you make wiser choices based on your company’s specialized needs. Diana is a data scientist who believes that insights from data is accessible to every company. After focusing her studies in both statistics and computer science, she’s enjoyed successfully growing the role of data science at two enterprise software start-ups.
Introduction
With all the hype around data science and ubiquitous terms like “Big Data,” it’s easy to fall into a pernicious trap: Following ideas about how things “should” be done, rather than doing things that will best benefit your company. In my past four years as a practicing data scientist, I have noticed several commonly overlooked best practices when building out an analytics program in any organization. Below, I outline common data-science problems and offer solutions to help you make wiser choices based on your company’s specialized needs.
Treating Analytics as a Necessity
One common mentality considers analytics and other data initiatives as nice-to-have resources – something to address only after the company has reached some measure of stability. Though companies love to tout themselves as being data-driven in their decision-making, under this paradigm it usually comes too late.
Organizations should treat analytics initiatives as essential to both the company’s and the product’s success. Startup CTOs and technical product owners consider it bad practice to delay a continuous testing infrastructure for a new project, and an analytics infrastructure should be no different. Start early and build your engineering and product foundation on data, rather than on hunches and best guesses.
Find A Data Champion
While some forward-thinking companies establish analytics infrastructure or data organizations early, the majority of companies continue to follow the antiquated model of data silos. The marketing team owns marketing data, the engineering team owns server and production data, and so on. This older style of data management seems easier to manage, at first. Each group only has to deal with a limited piece of the puzzle. However, as they stack up, these information silos begin to create unnecessary complexity. Instead of one team understanding all of the nuances of the company’s data, the company has a variety of teams with varying agendas and different degrees of data fluency being forced to work together. Under this scheme, data analysis often falls to the bottom of the priority pile.
In small or young companies, a full-fledged data organization is not essential to gain the benefits of an analytics initiative. But you do need a data champion. Data champions “own” all the data. They define what will be collected, ensure the collection pipeline is working, monitor data quality and champion this use of data wherever possible. Data champions don’t necessarily need to perform analysis, but they are the keepers of the data.
Larger organizations are lucky, they can build true data or analytics organizations. Yet I still see many larger companies sticking with outdated silos. In addition, many of these data and analytics organizations get leveraged only for analysis, when they should own the entire pipeline: collection, storage, clean-up, analysis and visualization. Why have an engineering or marketing talent waste precious time collecting and organizing data, when one organization can do it all?
Define Questions First, Then Collect Data
It’s easy to get excited when it’s time to start collecting data, but sometimes the thrill leads to the neglect of important steps. Deciding what data to collect is often left to the engineering department, which generally collects what it needs in the near-term, without consideration for future questions or data that will be useful in the weeks and years ahead. As a result, when it’s time do any analysis, the data is riddled with gaps. This forces companies to return to step one, and requires a pause in analysis while new data is accumulated. The frustration caused by the wasted time and talent overshadows the benefits data brings to organizations.
The solution is simple, but sadly it is rarely implemented. Define a full range of questions first, and then build processes to collect the data necessary to answer those questions. Treat data collection like the scientific method: create a hypothesis, collect data research and analyze and review your conclusions. Then rinse and repeat. Data collection should never be finished; it will continually evolve as your needs change.
Start Small but Think about Scalability
Far too often companies start too big with their analytics infrastructure and believe they need to immediately include everything, including the kitchen sink. So they rush to build a massive and highly specialized data pipeline that is costly to run and maintain. As the first insights come in, the company recognizes the value of data analysis, but the expense is too high.
Tackling too much at once leads to sloppiness and overspending. When beginning, choose a few use cases and start building out an infrastructure that’s flexible for your needs with reasonable scalability. Chances are, your analytics needs are going to iterate rapidly once your organization develops an appetite for data, so start small and use incremental building blocks. Don’t go for the newest technology just because it’s the cool new thing. Once your initial pipeline is running smoothly, it will be easier to add and subtract pieces incrementally.
Control Your Own Data
On the other hand, when some companies start thinking about data analytics, they make the right decision and start small. But then they go overboard, and the initiative is too small. They use products and services that hold their data hostage, or are cheap or even free as long as the data volume is limited. As data volume increases, they face a trade off of exponentially increasing costs, or limiting their analytics to massively restricted datasets. Imagine what kinds of trends get missed because a third party vendor won’t allow full access to data.
Having full control of data helps companies get the most out of analytics experiences. The majority of analytics use cases quickly outgrow the out-of-the-box solutions, and are left back at square one: building the analytics infrastructure they were trying to avoid in the first place. Meanwhile, valuable time and data has been wasted. Avoid the easy temptation that some of these vendors offer and focus on services and products that give full control of your data from day one.
Recognize the Data You Already Have
Remember those data silos? Good. Many sources of data are overlooked, like CRM, HCM and other enterprise application data. Intelligently defined subsets of that data should be integrated with your data warehouse or storage layer so that analysts and data scientists can easily access them without having to jump through hoops.
As companies build out data infrastructures, they should keep in mind the other sources of data they will want to collect, besides the obvious sources. Find overlooked data and pull it together into one place. This allows data champions to merge, morph and utilize everything they can, allowing organizations to optimize the value of their data initiatives.
Don’t just set it and forget it
Congratulations! You have built a data analytics pipeline and data is happily streaming to a single source of truth, give yourself a hand! You have accomplished something even top Fortune 500 companies struggle to achieve. But there is a problem: You don’t have the resources to analyze it.
I’m always amazed when I hear about organizations that want to build analytics infrastructure before hiring actual analysts. This is quite literally putting the cart before the horse. Why pay to collect and store data if you are not going to immediately start analyzing it? For startups, where everything changes at lightning speed, having massive amounts of historical data sitting and collecting dust is a waste of money and resources. Analysts and data scientists need to be part of the budget from the beginning. This is vital. A data initiative needs to be about the analysis and how that aids the organization rather than about the data. Collecting data and letting it sit is equivalent to letting money accumulate in a bank with no interest rate.
It’s hard not to fall into the hype-trap of Big Data and analytics, for good reason — when data analytics is done correctly, it brings tremendous value to growing organizations. But doing it correctly makes all the difference — it’s what lets the reality live up to the hype. Being able to recognize and solve the problems I have outlined above will allow you to create a data analytics pipeline that makes the most of your valuable time, resources and talent.
This article was originally published insidebigdata.com and can be viewed in full here


Archive
- October 2024(44)
- September 2024(94)
- August 2024(100)
- July 2024(99)
- June 2024(126)
- May 2024(155)
- April 2024(123)
- March 2024(112)
- February 2024(109)
- January 2024(95)
- December 2023(56)
- November 2023(86)
- October 2023(97)
- September 2023(89)
- August 2023(101)
- July 2023(104)
- June 2023(113)
- May 2023(103)
- April 2023(93)
- March 2023(129)
- February 2023(77)
- January 2023(91)
- December 2022(90)
- November 2022(125)
- October 2022(117)
- September 2022(137)
- August 2022(119)
- July 2022(99)
- June 2022(128)
- May 2022(112)
- April 2022(108)
- March 2022(121)
- February 2022(93)
- January 2022(110)
- December 2021(92)
- November 2021(107)
- October 2021(101)
- September 2021(81)
- August 2021(74)
- July 2021(78)
- June 2021(92)
- May 2021(67)
- April 2021(79)
- March 2021(79)
- February 2021(58)
- January 2021(55)
- December 2020(56)
- November 2020(59)
- October 2020(78)
- September 2020(72)
- August 2020(64)
- July 2020(71)
- June 2020(74)
- May 2020(50)
- April 2020(71)
- March 2020(71)
- February 2020(58)
- January 2020(62)
- December 2019(57)
- November 2019(64)
- October 2019(25)
- September 2019(24)
- August 2019(14)
- July 2019(23)
- June 2019(54)
- May 2019(82)
- April 2019(76)
- March 2019(71)
- February 2019(67)
- January 2019(75)
- December 2018(44)
- November 2018(47)
- October 2018(74)
- September 2018(54)
- August 2018(61)
- July 2018(72)
- June 2018(62)
- May 2018(62)
- April 2018(73)
- March 2018(76)
- February 2018(8)
- January 2018(7)
- December 2017(6)
- November 2017(8)
- October 2017(3)
- September 2017(4)
- August 2017(4)
- July 2017(2)
- June 2017(5)
- May 2017(6)
- April 2017(11)
- March 2017(8)
- February 2017(16)
- January 2017(10)
- December 2016(12)
- November 2016(20)
- October 2016(7)
- September 2016(102)
- August 2016(168)
- July 2016(141)
- June 2016(149)
- May 2016(117)
- April 2016(59)
- March 2016(85)
- February 2016(153)
- December 2015(150)