So far we covered what data science is, how it is helping in scientific discovery and who the active players are. But now it is important to understand what are the latest tools available to reach diverse Data Science Goals, the limitations these tools currently face and the availability of better solutions.

The field of Data Science attempts to create value from information by passing it through four principal stages, namely data preparation, data analysis, data reflections and data dissemination. This process of Data Science has opened the doors for accelerating discoveries but, as the saying goes “nothing comes easy in life.”

Figure 1 DataScience Workflow(Guo, 2016)

With the world getting connected more and more every day, information is growing vast and complex. Alex Szalay, an astrophysicist at Johns Hopkins University, says “How to make sense of all these data? People should be worried about how we train the next generation, not just of scientists, but people in government and industry” (The Economist, 2010). The information today demands companies and research institutes to come up with cost-effective and cutting edge technologies for enhanced insight, and decision making.

Let’s have a look at a few of the challenges information brings in the field of data science, and a few of the techniques researchers use. The challenges of data science are in terms of 5 V’s namely Volume, Variety, Velocity, Veracity and Value.

Figure 2 The 5V's of Big Data (Sweetlysocial, 2016)

Cluster processing to the rescue

To start with, the ever expanding volume and variety of information demands solid infrastructure in which we can store large scale datasets for analyzing them. Hadoop is one of the tools that is emerging as an efficient framework to help with the storage of big data. It is an open source platform which makes use of Google’s Map reduce program for processing large datasets at a granular level.

Not only this, Hadoop also provides a Hadoop distributed file system that allows parallel processing by spreading the data over different nodes. In May 2009, Hadoop made a world record for sorting a Petabyte of data in 16.25 hours and 1 TB of data in 62 seconds. (Rosenberg, 2009). Hadoop has enormous potential for making medical discoveries and hence is being used by several large genomics and medical projects. However, a bigger challenge than storing data is the challenge of processing data in a timely manner. Hadoop is constrained by its Disk IO and requirement of advanced programming skills by the developers.

To deal with the velocity and veracity of information, Apache Spark comes to our rescue. Spark can process data 100X faster than Map Reduce. In addition, Spark provides in-memory data processing and an in-built machine learning library. The in-memory data processing allows it to avoid in and out disk operations and achieve greater speed. The inbuilt machine learning library is composed of several machine learning algorithms that researchers can use with ease to handle complex structured and unstructured data.

A neural way of discovering

So far we saw how big data can be stored and processed faster. But, an important question that still remains is whether the insights generated from the analysis of data is useful or not. How do we identify valuable information from the dumps of Zettabyte data? To work out this challenge, a lot of machine learning techniques and visualization techniques are being developed. One of the hottest trend in machine learning is the deep learning algorithm. Deep learning is also known as artificial neural network. It combines the simple features into a complex feature layer by layer to extract high level abstract data representation.

Figure 3Deep Learning (Toy, 2016)

Notably, a new built computer model has detected genetic determinants of autism, colon cancer and spinal muscular atrophy in large areas of the genome that previously could not be identified. It used DNA sequences from five autistic patients and identified 39 new genes in autism spectrum disorder. This is claimed to be a 40 percent increase from roughly 100 previously known autism genes. Brendan Frey, a CIFAR senior fellow of the University of Toronto, says “My participation in the Neural Computation & Adaptive Perception program enabled my group to have access to the best techniques in deep learning.” (Cifar.ca, 2016)

Open source is the way to go

The above mentioned techniques are not the only techniques in solving the challenges. Companies and Universities alike are developing free, libre open source solutions to broaden the possibilities of Data Science and helping technology advance in a timely manner. There are tons of other open source projects such as Google’s Tensor Flow, Amazon Machine learning, Scikit-learn, H2O and etc.

The journey of data science has been thrilling and the road ahead looks even more exciting. We will be back with more interesting content on data science so stay tuned!

References

The Economist. (2010). Data, data everywhere. [online] Available at: http://www.economist.com/node/15557443 [Accessed 6 Jun. 2016].

Ieeexplore.ieee.org. (2016). IEEE Xplore Full-Text PDF:. [online] Available at: http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=7067026&tag=1 [Accessed 6 Jun. 2016].

Guo, P. (2016). Data Science Workflow: Overview and Challenges. [online] Cacm.acm.org. Available at: http://cacm.acm.org/blogs/blog-cacm/169199-data-science-workflow-overview-and-challenges/fulltext [Accessed 6 Jun. 2016].

Rosenberg, D. (2009). Hadoop breaks data-sorting world records. [online] CNET. Available at: http://www.cnet.com/au/news/hadoop-breaks-data-sorting-world-records/ [Accessed 6 Jun. 2016].

Cifar.ca. (2016). Deep learning finds autism, cancer mutations in unexplored regions of the genome : CIFAR. [online] Available at: https://www.cifar.ca/assets/deep-learning-finds-autism-cancer-mutations-in-unexplored-regions-of-the-genome/ [Accessed 6 Jun. 2016].

Toy, J. (2016). opening up deep learning for everyone. [online] Jtoy.net. Available at: http://www.jtoy.net/2016/02/14/opening-up-deep-learning-for-everyone.html [Accessed 6 Jun. 2016].

Google.com.au. (2016). Redirect Notice. [online] Available at: https://www.google.com.au/url?sa=i&rct=j&q=&esrc=s&source=images&cd=&cad=rja&uact=8&ved=&url=http%3A%2F%2Fsweetlysocial.net%2Fbig-data-better-marketing%2F&psig=AFQjCNFz3aw0oHgWUtn7rISU68AsVqSchw&ust=1465280114674811 [Accessed 6 Jun. 2016].

Anon, (2016). [online] Available at: http://cra.org/ccc/wp-content/uploads/sites/2/2015/05/bigdatawhitepaper.pdf [Accessed 6 Jun. 2016].

Sweetlysocial.net. (2014). Big Data, Better Marketing | Sweetly Social. [online] Available at: http://sweetlysocial.net/big-data-better-marketing/ [Accessed 6 Jun. 2016].

Kdnuggets.com. (2016). 5 Best Machine Learning APIs for Data Science. [online] Available at: http://www.kdnuggets.com/2015/11/machine-learning-apis-data-science.html [Accessed 6 Jun. 2016].

DISCOVERING SCIENCE THROUGH DATA SCIENCE

Monday, 6 June 2016

Pushing The Boundaries Of Discoveries With Data Science

Cluster processing to the rescue

A neural way of discovering

Open source is the way to go

References

No comments:

Post a Comment