
Take a good look at Spark, don’t overcopy data and, for goodness sake, don’t use Microsoft Excel. This was just some of the advice offered by Janis Landry-Lane, worldwide software-defined life sciences industry lead at IBM, as she delivered her headline session at the Computing IT Leaders Forum.
Speaking at the ‘Learning from the Leaders in Big Data’ event in London, Landry-Lane spelt out her key advice for big data practitioners based on 15 years of experience in high-performance computing, a field that led directly to the big data revolution. She is now spearheading big data projects that are helping sequence genomes and cure diseases.
1. Effective proof-of-concept periods lead to speed and provenance.
“Our role is to say ‘Let’s design the proof-of-concept [PoC] so then when it’s wildly successful, you can get the answer today – not tomorrow, not the next day’,” said Landry-Lane. It’s a complex model indeed, which needs to take a huge array of factors into account.
“Speed is key, data provenance is key. Provenance means – in healthcare – if they’re going to sequence your genome and give you a drug based on your genome, and it’s the right drug, and that it’s your genome, and the data has not been touched and that it’s been processed with the latest known algorithms.
“These are the types of things we have to pay attention to.”
2. Don’t use Excel
Landry-Lane explained how one of her biggest takeaways from “installing some of the very largest installations in the world” was relating scaling to big data, and “making it better, and stopping people falling off the edge”.
Falling off the edge she defined, basically, as falling back on spreadsheets to attempt big data analysis – still a surprisingly common practice among fledgling or simply lazy big data projects.
3. Use Spark, but maybe not Hadoop
“I love the fact that Spark is gaining momentum,” said Landry-Lane about the Apache open source cluster framework. On the other hand, “Hadoop is way too complex – it’s slow and it can never do what we needed it to do” she advised.
Landry-Lane is also a fan of “build, ship and run” automated deployment platform Docker. “I deal a lot in other industries with Docker containers – we have a big issue with reproducibility and a big issue of metadata,” she said.
4. Keep a single dataset, and manage it properly
“We’ve seen users doing a lot of their own administration – they were moving data, copying data, analysing data,” explained Landry-Lane.
“But big data is too big to copy, move and have many copies of. You really can get down to one copy – you can get to something called a global namespace, where everything shares a global file system.
“You do not have to have it copied over to Spark or Hadoop and then reload into MongoDB or whatever – you don’t need to do this. There is technology – which has been out there since 1998 – so why should you do this? It just hasn’t come to all industries yet from out of high performance computing.”
Landry-Lane was referring to “IBM General Parallel File System” – an “IBM asset, and one of the finest we have – but the new folks will call it Spectrum Scaler,” she quipped, recommending that software-defined storage for high performance workloads can lead to greater scalability with just one dataset.
5. Think also about software-defined storage simply because data is getting massive
A typical genome sequence weighs in at 600GB, and will still be 200GB when compressed, said Landry-Lane. This is an indicator of not just medical data, but how big data is, as ever, getting bigger and bigger.
“So let’s use software to manage the data – you can’t afford to keep data on spinning disk, you can’t afford to keep it in object store, but you may need to keep it,” she explained.
“We need to keep it potentially for the life of the patient, and have it sequenced many times, and analysed many times.
“The typical sequence size is 600GB. We’re going to start keeping that kind of data on individuals, and that will change a lot, because your DNA sequences change.”
This article was originally published on www.computing.com and can be viewed in full


Archive
- October 2024(44)
- September 2024(94)
- August 2024(100)
- July 2024(99)
- June 2024(126)
- May 2024(155)
- April 2024(123)
- March 2024(112)
- February 2024(109)
- January 2024(95)
- December 2023(56)
- November 2023(86)
- October 2023(97)
- September 2023(89)
- August 2023(101)
- July 2023(104)
- June 2023(113)
- May 2023(103)
- April 2023(93)
- March 2023(129)
- February 2023(77)
- January 2023(91)
- December 2022(90)
- November 2022(125)
- October 2022(117)
- September 2022(137)
- August 2022(119)
- July 2022(99)
- June 2022(128)
- May 2022(112)
- April 2022(108)
- March 2022(121)
- February 2022(93)
- January 2022(110)
- December 2021(92)
- November 2021(107)
- October 2021(101)
- September 2021(81)
- August 2021(74)
- July 2021(78)
- June 2021(92)
- May 2021(67)
- April 2021(79)
- March 2021(79)
- February 2021(58)
- January 2021(55)
- December 2020(56)
- November 2020(59)
- October 2020(78)
- September 2020(72)
- August 2020(64)
- July 2020(71)
- June 2020(74)
- May 2020(50)
- April 2020(71)
- March 2020(71)
- February 2020(58)
- January 2020(62)
- December 2019(57)
- November 2019(64)
- October 2019(25)
- September 2019(24)
- August 2019(14)
- July 2019(23)
- June 2019(54)
- May 2019(82)
- April 2019(76)
- March 2019(71)
- February 2019(67)
- January 2019(75)
- December 2018(44)
- November 2018(47)
- October 2018(74)
- September 2018(54)
- August 2018(61)
- July 2018(72)
- June 2018(62)
- May 2018(62)
- April 2018(73)
- March 2018(76)
- February 2018(8)
- January 2018(7)
- December 2017(6)
- November 2017(8)
- October 2017(3)
- September 2017(4)
- August 2017(4)
- July 2017(2)
- June 2017(5)
- May 2017(6)
- April 2017(11)
- March 2017(8)
- February 2017(16)
- January 2017(10)
- December 2016(12)
- November 2016(20)
- October 2016(7)
- September 2016(102)
- August 2016(168)
- July 2016(141)
- June 2016(149)
- May 2016(117)
- April 2016(59)
- March 2016(85)
- February 2016(153)
- December 2015(150)