
“Big data” means different things to different people. Ask the average consumer and they’ll probably say it’s something to do with the cloud. Ask a business owner and you’ll hear about detailed information that can be combed for ways to increase profits, and an IT professional is likely going to get into the nitty-gritty of the newest, fastest ways to process massive amounts of data.
For Cloudera Inc., a data company out of Austin, TX, big data isn’t just about profits: it’s also valuable for public health projects like fighting the Zika virus. It recently hosted a hackathon in partnership with the University of Texas to find uses for data generated by the current outbreak, and in just one day were able to develop several ways to model and predict Zika’s spread.
Over 50 local data scientists, programmers, and tech experts came together to scrape data from the CDC, WHO, and ECDC, with some of the results looking quite promising. One programmer used TensorFlow machine learning and satellite images to find standing pools of water, while others focused on designing an app that crowdsources data from those experiencing symptoms. The app would automatically geolocate new cases and track the spread of the disease, enabling health workers to target an area before the outbreak is serious.
Attendees also learned a lot about how companies like Cloudera crunch data and how to incorporate those methods into public health research. Eddie Garcia, Cloudera’s chief security architect, said that a cure for Zika wasn’t the goal of the hackathon. “[We want] to build awareness around the disease,” he said, as well as to “highlight the challenges to find and create data sets for research and the socialization of open data sets for social good.” Garcia and those who worked on the hackathon want to keep the project going, and they envision this as only the first of many events that find socially valuable ways to use data.
Are hackathons the ideal format?
Sitting in a room full of tech professionals with one goal is invigorating, but if you ask Miriam Young, communications director at DataKind, she’ll say they aren’t the most effective way to get things done. “A lot of great ideas come out of hackathons,” she said, “but those ideas rarely lead to a useable product.” With the ever-increasing popularity of hackathons there are problems that need to be addressed, which is exactly what DataKind aims to do.
As opposed to a one-off, loosely organized event, DataKind focuses on long-term projects that it calls data dives. “We partner directly with the organizations we’re helping so that we can work with them, not for them,” Young said. DataKind staff and project volunteers often spend months researching and collating data before a weekend data dive, and the end result for DataKind has been a multitude of permanent projects that have made real difference.
DataKind data dives have produced human rights alert filtering systems for Amnesty International, a triage system for Crisis Text Line, and even mobile device software for Nexleaf that prevents vaccine spoilage and maximizes effectiveness. “Big data can be an amazing resource for public health. The only problem is getting the biggest benefit out of the massive amount of data an organization might have,” Young said.
Collaboration is the name of the game as far as DataKind founder Jake Porway is concerned. “Without subject matter experts available to articulate problems in advance, you get results … [that] solve the participants’ problems,” he wrote in the Harvard Business Review.
“As data scientists, we are well equipped to explain the ‘what’ of data, but rarely should we touch the question of ‘why’ on matters we are not experts in,” Porway added. Hackathons, he said, are often a free-for-all that simply doesn’t address the real needs of the host organization. DataKind’s team is in constant communication with subject matter experts, and Porway doesn’t think it can work any other way.
It isn’t just the hackathon format that causes problems either: it’s also the data itself. Whether gathered from the CDC, Google search results, or any other method there’s an inherent problem that Porway also wants to call attention to: there is no such thing as “raw data.”
Cloudera data scientist Juliet Hougland agrees with DataKind, which is in large part why the Zika virus hackathon is the first in a series. “We partnered with the Golden Gate National Parks Conservancy (GGNPC) to track the reintroduction of local plant species, and there’s one big reason we succeeded: there was a member of their team at the hackathon with us.”
To Hougland and the Cloudera team there’s simply too much data out there to dive in without guidance from someone who knows the material. In the case of the Zika hackathon, multiple events and a close partnership with University of Texas Austin is how they’ll create results. “It takes time to merge datasets and find relations,” Hougland said, “which is why we plan on using UT Austin’s computing resources to continue analysis well after our hackathons are over.”
Where big data falls short
Most people who follow news about big data are familiar with the Google Flu Trends failure in 2013. Google tried to use search data to predict flu rates in advance of flu season, but the end result was anything but accurate. Google ended up missing the mark by more than double the CDC’s numbers, and 2013 wasn’t the first time – it had the same problem in 2011 and 2012 as well.
It’s hard to pinpoint just why Google Flu Trends failed, but the possibilities aren’t unique. Data-gathering algorithms change, participants can manipulate the data that’s gathered, and biases in both the participants and the data organization can affect what is considered valuable.
Data scientist Cathy O’Neil said that, quite simply, humans place too much faith in algorithms. “Algorithms are just as biased as human curators,” she said in a recent blog post. She also said that we often trust those algorithms more than people despite the fact that they are created by human programmers. It’s here that big data, in most any form, starts to fall short.
What this means for your organization
If you’re collecting, sorting, or using big data there’s a lot to consider. Consider the following points so you don’t make Google Flu Trend-level mistakes with your data projects:
- Collaboration that gets results won’t always happen in one weekend. Your subject matter experts need to work with programmers until your data turns into real, usable information.
- Biases are everywhere, so don’t assume your raw data is just data—it’s all touched by people at some point. You’ll get much greater results by being transparent about every step of the data gathering process: algorithms, methods, and even the questions being asked all color your data.
- If you want to host a hackathon don’t expect big results to come out of it. Cloudera’s event is expected to be just one of many and yours should be too.
Big data has the potential to save lives and change the way we live, but like anything else scrutiny is essential. As Jake Porway said, data isn’t just numbers: it’s the quantification of our world. If you want to capture the beauty of big data you’re going to need to commit.
This article was originally published on www.techrepublic.com and can be viewed in full


Archive
- October 2024(44)
- September 2024(94)
- August 2024(100)
- July 2024(99)
- June 2024(126)
- May 2024(155)
- April 2024(123)
- March 2024(112)
- February 2024(109)
- January 2024(95)
- December 2023(56)
- November 2023(86)
- October 2023(97)
- September 2023(89)
- August 2023(101)
- July 2023(104)
- June 2023(113)
- May 2023(103)
- April 2023(93)
- March 2023(129)
- February 2023(77)
- January 2023(91)
- December 2022(90)
- November 2022(125)
- October 2022(117)
- September 2022(137)
- August 2022(119)
- July 2022(99)
- June 2022(128)
- May 2022(112)
- April 2022(108)
- March 2022(121)
- February 2022(93)
- January 2022(110)
- December 2021(92)
- November 2021(107)
- October 2021(101)
- September 2021(81)
- August 2021(74)
- July 2021(78)
- June 2021(92)
- May 2021(67)
- April 2021(79)
- March 2021(79)
- February 2021(58)
- January 2021(55)
- December 2020(56)
- November 2020(59)
- October 2020(78)
- September 2020(72)
- August 2020(64)
- July 2020(71)
- June 2020(74)
- May 2020(50)
- April 2020(71)
- March 2020(71)
- February 2020(58)
- January 2020(62)
- December 2019(57)
- November 2019(64)
- October 2019(25)
- September 2019(24)
- August 2019(14)
- July 2019(23)
- June 2019(54)
- May 2019(82)
- April 2019(76)
- March 2019(71)
- February 2019(67)
- January 2019(75)
- December 2018(44)
- November 2018(47)
- October 2018(74)
- September 2018(54)
- August 2018(61)
- July 2018(72)
- June 2018(62)
- May 2018(62)
- April 2018(73)
- March 2018(76)
- February 2018(8)
- January 2018(7)
- December 2017(6)
- November 2017(8)
- October 2017(3)
- September 2017(4)
- August 2017(4)
- July 2017(2)
- June 2017(5)
- May 2017(6)
- April 2017(11)
- March 2017(8)
- February 2017(16)
- January 2017(10)
- December 2016(12)
- November 2016(20)
- October 2016(7)
- September 2016(102)
- August 2016(168)
- July 2016(141)
- June 2016(149)
- May 2016(117)
- April 2016(59)
- March 2016(85)
- February 2016(153)
- December 2015(150)