Monday, 13 June 2016

What Data Science does, can do and will do for scientific discoveries

So far, we have proven that Data Science is not only for business, marketing and advertising, but can have practical uses for various types of fields in Science, ranging from Astronomy to Physics, and beyond. But the examples presented are just the beginning.
We have ascertained how Data Science can be useful in the pursuit of scientific discoveries, but now it is time to review the vanguard of the industry, where it is headed, what does the future holds and what might be some roadblocks to those ends.

Deep learning the future

The concept of deep learning is based of neural networks, which has been around for a long time (Roberts, 2014), but it has taken speed in the latest years, with corporations and universities investing in  multi-layer neural networks destined for the most varied of uses.

Without going much further, we can mention the much mentioned project Google Deepmind, which has developed things like human-level control deep reinforcement learning and many other projects (Google Deepmind, 2011).

Figure 1 – Example of how deep learning works for face recognition (Mayer, 2015)

But maybe the most impressive and well known feat achieved by the project is the development of a deep learning program that managed to learn the complex game of Go, and defeated the top Go player, Lee Sedol, in a 5-match competition (Gibney, 2016). This is an accomplishment that shows how much Deep Learning has advanced, since experts said that a computer would never beat a human player (Cho, 2016).
That’s why deep learning is being experimented on cell classification (Chen, et al., 2016), chemical mappings, x-ray scattering image classifications and many more (Brookhaven National Laboratory, 2015). Even major universities and research centers are investing in deep learning, like NERSC and Berkeley joining forces to test the capacity of the technology with health and medicine breakthroughs (Kincade, 2015).

Data Science as an aid of human knowledge broadening

With science advancing in giant leaps in several fields, and instruments getting more powerful and sophisticated, the amount of data to process is getting bigger and bigger. That is where data science comes into the scene.
The detection of gravitational waves is one of the biggest headlines in scientific discovery in the past months (Overbye, 2016), confirming a 100-year old Einstein theory. But the fact is that the Laser Interferometer Gravitational-Wave Observatory received a particular strong signal that managed to confirm the theory, a feat that proved difficult because of the difficulty of discerning signals from noise. That is how Data Science could help make this separation of signals from noise easier by finding underlying evidence by processing the outstanding amount of data produced by their equipment (Yuan, 2016)

Figure 2 – Consistent signals detected in LIGO sites located 2000 miles apart (Circus Bazaar, 2016)

And it is worth mentioning how Data Science could help Astronomy. As telescopes get more complex and sensitive to light, the amount of data gathered is getting larger and unmanageable. That is the reason several projects are using Data mining to recognize celestial bodies, to try to keep up with the data production (Galaxy Zoo, 2016).

But it is not a paved road ahead

As sciences advances, so does the fear that humans will be replaced by robots. With predictions of computers with advanced neural networks replacing entry level lawyers (Kravets, 2015), and advances made with IBM Watson learning case histories of hospitals to learn what diagnoses and treatments to recommend (Cohn, 2013), there is a concern about how the advancements of Data Science are going to affect the rest of the population.

Figure 3 – Example of IBM Watson’s healthcare capabilities (Saxena, 2012)

Also, for data science to thrive, it needs data. And because scientific papers, research and publications are so difficult or expensive to get a hold of (The Cost of Knowledge, 2012), sometimes the raw data or sources necessary to discover something novel is somewhat of an utopia; with publishing companies charging enormous amounts to get a glimpse of their material (Elbakyan, 2015).

Data Science’s has yet no bounds

While there are still titanic challenges in the sciences that Data Science is yet to conquest, there are breakthroughs made by the day, trying to overcome shortcomings and achieve a better understanding in several fields of science (Prabhat, 2015).
So the future looks bright for Data Science, showing significant increase of demand of people expert in the field (Islam, 2015), a number of companies getting into the game and being a participant an active participant of scientific discoveries. It is to be seen how bright it can be (NeRSC, 2015).

Bibliography

Brookhaven National Laboratory, 2015. Deep Learning for Analysis of Materials Science Data. [Online]
Available at: https://www.bnl.gov/compsci/projects/deep-learning.php
[Accessed June 2016].
Chen, C. L. et al., 2016. Deep Learning in Label-free Cell Classification. [Online]
Available at: http://www.nature.com/articles/srep21471
[Accessed June 2016].
Cho, A., 2016. Computer that mimics human brain beats professional at game of Go. [Online]
Available at: http://www.sciencemag.org/news/2016/01/huge-leap-forward-computer-mimics-human-brain-beats-professional-game-go
[Accessed June 2016].
Circus Bazaar, 2016. On Einsten's Gravitational Waves: The Paper. [Online]
Available at: http://www.circusbazaar.com/on-einsteins-gravitational-waves-the-paper/
[Accessed June 2016].
Cohn, J., 2013. The Robot Will See You Now. The atlantic, March.Issue March 2013 Issue.
Elbakyan, A., 2015. Sci-Hub Reply. [Online]
Available at: https://torrentfreak.com/images/sci-hub-reply.pdf
[Accessed June 2016].
Galaxy Zoo, 2016. Galaxy Zoo. [Online]
Available at: https://www.galaxyzoo.org/
[Accessed June 2016].
Gibney, E., 2016. Google AI algorithm masters ancient game of Go. [Online]
Available at: http://www.nature.com/news/google-ai-algorithm-masters-ancient-game-of-go-1.19234
[Accessed June 2016].
Google Deepmind, 2011. Publications. [Online]
Available at: https://deepmind.com/publications
[Accessed June 2016].
Islam, M., 2015. Future of Data Science and Data Scientists. [Online]
Available at: https://www.linkedin.com/pulse/future-data-science-scientist-mohammad-islam
[Accessed June 2016].
Kincade, K., 2015. NERSC, Berkeley Lab Explore Frontiers of Deep Learning for Science. [Online]
Available at: http://www.nersc.gov/news-publications/nersc-news/science-news/2015/nersc-berkeley-lab-explore-frontiers-of-deep-learning-for-science/
[Accessed June 2016].
Kravets, D., 2015. Law firm bosses envision Watson-type computers replacing young lawyers. [Online]
Available at: http://arstechnica.com/tech-policy/2015/10/law-firm-bosses-envision-watson-type-computers-replacing-young-lawyers/
[Accessed June 2016].
Mayer, R., 2015. Deep Learning Smarts Up Your Smart Phone. [Online]
Available at: http://www.amax.com/blog/wp-content/uploads/2015/12/blog_deeplearning3.jpg
[Accessed June 2016].
NeRSC, 2015. Berkeley Lab Climate Software Honored for Pattern Recognition Advances. [Online]
Available at: https://www.nersc.gov/news-publications/nersc-news/nersc-center-news/2015/berkeley-lab-climate-software-honored-for-pattern-recognition-advances/
[Accessed June 2016].
Overbye, D., 2016. Gravitational Waves Detected, Confirming Einstein’s Theory. [Online]
Available at: http://mobile.nytimes.com/2016/02/12/science/ligo-gravitational-waves-black-holes-einstein.html
[Accessed June 2016].
Prabhat, 2015. Big science problems, big data solutions. [Online]
Available at: https://www.oreilly.com/ideas/big-science-problems-big-data-solutions
[Accessed June 2016].
Roberts, E., 2014. Neural Networks History: The The 1940's to the 1970's. [Online]
Available at: https://cs.stanford.edu/people/eroberts/courses/soco/projects/neural-networks/History/history1.html
[Accessed June 2016].
Saxena, M., 2012. Putting IBM Watson to Work. [Online]
Available at: http://www.slideshare.net/manojsaxena2/putting-ibm-watson-to-work-saxena
[Accessed June 2016].
The Cost of Knowledge, 2012. Statement of Purpose. [Online]
Available at: https://gowers.files.wordpress.com/2012/02/elsevierstatementfinal.pdf
[Accessed June 2016].
Yuan, M., 2016. Gravitational Wave Ushers in a New Wave of Data Science. [Online]
Available at: https://austinstartups.com/gravitational-wave-ushers-in-a-new-wave-of-data-science-928d620d727#.8x5fq0g9i
[Accessed June 2016].


Monday, 6 June 2016

Pushing The Boundaries Of Discoveries With Data Science


So far we covered what data science is, how it is helping in scientific discovery and who the active players are. But now it is important to understand what are the latest tools available to reach diverse Data Science Goals, the limitations these tools currently face and the availability of better solutions.

The field of Data Science attempts to create value from information by passing it through four principal stages, namely data preparation, data analysis, data reflections and data dissemination. This process of Data Science has opened the doors for accelerating discoveries but, as the saying goes “nothing comes easy in life.”



Figure 1 DataScience Workflow(Guo, 2016)

With the world getting connected more and more every day, information is growing vast and complex. Alex Szalay, an astrophysicist at Johns Hopkins University, says “How to make sense of all these data? People should be worried about how we train the next generation, not just of scientists, but people in government and industry(The Economist, 2010). The information today demands companies and research institutes to come up with cost-effective and cutting edge technologies for enhanced insight, and decision making.

Let’s have a look at a few of the challenges information brings in the field of data science, and a few of the techniques researchers use. The challenges of data science are in terms of 5 V’s namely Volume, Variety, Velocity, Veracity and Value.




Figure 2 The 5V's of Big Data (Sweetlysocial, 2016)

Cluster processing to the rescue


To start with, the ever expanding volume and variety of information demands solid infrastructure in which we can store large scale datasets for analyzing them. Hadoop is one of the tools that is emerging as an efficient framework to help with the storage of big data. It is an open source platform which makes use of Google’s Map reduce program for processing large datasets at a granular level.
Not only this, Hadoop also provides a Hadoop distributed file system that allows parallel processing by spreading the data over different nodes. In May 2009, Hadoop made a world record for sorting a Petabyte of data in 16.25 hours and 1 TB of data in 62 seconds. (Rosenberg, 2009). Hadoop has enormous potential for making medical discoveries and hence is being used by several large genomics and medical projects. However, a bigger challenge than storing data is the challenge of processing data in a timely manner. Hadoop is constrained by its Disk IO and requirement of advanced programming skills by the developers.

To deal with the velocity and veracity of information, Apache Spark comes to our rescue. Spark can process data 100X faster than Map Reduce. In addition, Spark provides in-memory data processing and an in-built machine learning library. The in-memory data processing allows it to avoid in and out disk operations and achieve greater speed. The inbuilt machine learning library is composed of several machine learning algorithms that researchers can use with ease to handle complex structured and unstructured data.

A neural way of discovering


So far we saw how big data can be stored and processed faster. But, an important question that still remains is whether the insights generated from the analysis of data is useful or not. How do we identify valuable information from the dumps of Zettabyte data? To work out this challenge, a lot of machine learning techniques and visualization techniques are being developed. One of the hottest trend in machine learning is the deep learning algorithm. Deep learning is also known as artificial neural network. It combines the simple features into a complex feature layer by layer to extract high level abstract data representation.




Figure 3Deep Learning (Toy, 2016)

Notably, a new built computer model has detected genetic determinants of autism, colon cancer and spinal muscular atrophy in large areas of the genome that previously could not be identified. It used DNA sequences from five autistic patients and identified 39 new genes in autism spectrum disorder. This is claimed to be a 40 percent increase from roughly 100 previously known autism genes. Brendan Frey, a CIFAR senior fellow of the University of Toronto, says “My participation in the Neural Computation & Adaptive Perception program enabled my group to have access to the best techniques in deep learning.” (Cifar.ca, 2016)

Open source is the way to go


The above mentioned techniques are not the only techniques in solving the challenges. Companies and Universities alike are developing free, libre open source solutions to broaden the possibilities of Data Science and helping technology advance in a timely manner. There are tons of other open source projects such as Google’s Tensor Flow, Amazon Machine learning, Scikit-learn, H2O and etc.
The journey of data science has been thrilling and the road ahead looks even more exciting. We will be back with more interesting content on data science so stay tuned!

References

The Economist. (2010). Data, data everywhere. [online] Available at: http://www.economist.com/node/15557443 [Accessed 6 Jun. 2016].

Ieeexplore.ieee.org. (2016). IEEE Xplore Full-Text PDF:. [online] Available at: http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=7067026&tag=1 [Accessed 6 Jun. 2016].

Guo, P. (2016). Data Science Workflow: Overview and Challenges. [online] Cacm.acm.org. Available at: http://cacm.acm.org/blogs/blog-cacm/169199-data-science-workflow-overview-and-challenges/fulltext [Accessed 6 Jun. 2016].

Rosenberg, D. (2009). Hadoop breaks data-sorting world records. [online] CNET. Available at: http://www.cnet.com/au/news/hadoop-breaks-data-sorting-world-records/ [Accessed 6 Jun. 2016].

Cifar.ca. (2016). Deep learning finds autism, cancer mutations in unexplored regions of the genome : CIFAR. [online] Available at: https://www.cifar.ca/assets/deep-learning-finds-autism-cancer-mutations-in-unexplored-regions-of-the-genome/ [Accessed 6 Jun. 2016].

Toy, J. (2016). opening up deep learning for everyone. [online] Jtoy.net. Available at: http://www.jtoy.net/2016/02/14/opening-up-deep-learning-for-everyone.html [Accessed 6 Jun. 2016].

Google.com.au. (2016). Redirect Notice. [online] Available at: https://www.google.com.au/url?sa=i&rct=j&q=&esrc=s&source=images&cd=&cad=rja&uact=8&ved=&url=http%3A%2F%2Fsweetlysocial.net%2Fbig-data-better-marketing%2F&psig=AFQjCNFz3aw0oHgWUtn7rISU68AsVqSchw&ust=1465280114674811 [Accessed 6 Jun. 2016].

Anon, (2016). [online] Available at: http://cra.org/ccc/wp-content/uploads/sites/2/2015/05/bigdatawhitepaper.pdf [Accessed 6 Jun. 2016].

Sweetlysocial.net. (2014). Big Data, Better Marketing | Sweetly Social. [online] Available at: http://sweetlysocial.net/big-data-better-marketing/ [Accessed 6 Jun. 2016].

Kdnuggets.com. (2016). 5 Best Machine Learning APIs for Data Science. [online] Available at: http://www.kdnuggets.com/2015/11/machine-learning-apis-data-science.html [Accessed 6 Jun. 2016].