So far we covered what data science is, how it is helping in scientific discovery and who the active players are. But now it is important to understand what are the latest tools available to reach diverse Data Science Goals, the limitations these tools currently face and the availability of better solutions.
The field of Data Science attempts to create value from
information by passing it through four principal stages, namely data
preparation, data analysis, data reflections and data dissemination. This
process of Data Science has opened the doors for accelerating discoveries but,
as the saying goes “nothing comes easy in life.”
Figure 1 DataScience
Workflow(Guo, 2016)
With the world getting connected more and more every day,
information is growing vast and complex. Alex Szalay, an astrophysicist at
Johns Hopkins University, says “How to
make sense of all these data? People should be worried about how we train the
next generation, not just of scientists, but people in government and industry”
(The Economist, 2010). The information today
demands companies and research institutes to come up with cost-effective and
cutting edge technologies for enhanced insight, and decision making.
Let’s have a look at a few of the challenges information brings in the field of data science, and a few of the techniques researchers use. The challenges of data science are in terms of 5 V’s namely Volume, Variety, Velocity, Veracity and Value.
Let’s have a look at a few of the challenges information brings in the field of data science, and a few of the techniques researchers use. The challenges of data science are in terms of 5 V’s namely Volume, Variety, Velocity, Veracity and Value.
Figure 2 The 5V's of
Big Data (Sweetlysocial, 2016)
Cluster processing to the rescue
To start with, the ever expanding volume and variety of
information demands solid infrastructure in which we can store large scale datasets
for analyzing them. Hadoop is one of the tools that is emerging as an efficient
framework to help with the storage of big data. It is an open source platform
which makes use of Google’s Map reduce program for processing large datasets at
a granular level.
Not only this, Hadoop also provides a Hadoop distributed
file system that allows parallel processing by spreading the data over
different nodes. In May 2009, Hadoop made a world record for sorting a Petabyte
of data in 16.25 hours and 1 TB of data in 62 seconds. (Rosenberg,
2009). Hadoop has enormous potential for making medical discoveries and
hence is being used by several large genomics and medical projects. However, a
bigger challenge than storing data is the challenge of processing data in a
timely manner. Hadoop is constrained by its Disk IO and requirement of advanced
programming skills by the developers.
To deal with the velocity and veracity of information, Apache Spark comes to our rescue. Spark can process data 100X faster than Map Reduce. In addition, Spark provides in-memory data processing and an in-built machine learning library. The in-memory data processing allows it to avoid in and out disk operations and achieve greater speed. The inbuilt machine learning library is composed of several machine learning algorithms that researchers can use with ease to handle complex structured and unstructured data.
To deal with the velocity and veracity of information, Apache Spark comes to our rescue. Spark can process data 100X faster than Map Reduce. In addition, Spark provides in-memory data processing and an in-built machine learning library. The in-memory data processing allows it to avoid in and out disk operations and achieve greater speed. The inbuilt machine learning library is composed of several machine learning algorithms that researchers can use with ease to handle complex structured and unstructured data.
A neural way of discovering
So far we saw how big data can be stored and processed
faster. But, an important question that still remains is whether the insights
generated from the analysis of data is useful or not. How do we identify
valuable information from the dumps of Zettabyte data? To work out this
challenge, a lot of machine learning techniques and visualization techniques
are being developed. One of the hottest trend in machine learning is the deep
learning algorithm. Deep learning is also known as artificial neural network. It
combines the simple features into a complex feature layer by layer to extract
high level abstract data representation.
Figure 3Deep
Learning (Toy, 2016)
Notably, a new built computer model has detected genetic
determinants of autism, colon cancer and spinal muscular atrophy in large areas
of the genome that previously could not be identified. It used DNA sequences
from five autistic patients and identified 39 new genes in autism spectrum
disorder. This is claimed to be a 40 percent increase from roughly 100
previously known autism genes. Brendan Frey, a CIFAR senior fellow of the University
of Toronto, says “My participation in the Neural Computation & Adaptive
Perception program enabled my group to have access to the best techniques in
deep learning.” (Cifar.ca, 2016)
Open source is the way to go
The above mentioned techniques are not the only techniques
in solving the challenges. Companies and Universities alike are developing
free, libre open source solutions to broaden the possibilities of Data Science
and helping technology advance in a timely manner. There are tons of other open
source projects such as Google’s Tensor Flow, Amazon Machine learning,
Scikit-learn, H2O and etc.
The journey of data science has been thrilling and the road
ahead looks even more exciting. We will be back with more interesting content
on data science so stay tuned!
References
The Economist. (2010). Data, data
everywhere. [online] Available at: http://www.economist.com/node/15557443
[Accessed 6 Jun. 2016].
Ieeexplore.ieee.org. (2016). IEEE
Xplore Full-Text PDF:. [online] Available at:
http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=7067026&tag=1
[Accessed 6 Jun. 2016].
Guo, P. (2016). Data Science
Workflow: Overview and Challenges. [online] Cacm.acm.org. Available at:
http://cacm.acm.org/blogs/blog-cacm/169199-data-science-workflow-overview-and-challenges/fulltext
[Accessed 6 Jun. 2016].
Rosenberg, D. (2009). Hadoop
breaks data-sorting world records. [online] CNET. Available at:
http://www.cnet.com/au/news/hadoop-breaks-data-sorting-world-records/
[Accessed 6 Jun. 2016].
Cifar.ca. (2016). Deep learning
finds autism, cancer mutations in unexplored regions of the genome : CIFAR.
[online] Available at:
https://www.cifar.ca/assets/deep-learning-finds-autism-cancer-mutations-in-unexplored-regions-of-the-genome/
[Accessed 6 Jun. 2016].
Toy, J. (2016). opening up deep
learning for everyone. [online] Jtoy.net. Available at:
http://www.jtoy.net/2016/02/14/opening-up-deep-learning-for-everyone.html
[Accessed 6 Jun. 2016].
Google.com.au. (2016). Redirect
Notice. [online] Available at:
https://www.google.com.au/url?sa=i&rct=j&q=&esrc=s&source=images&cd=&cad=rja&uact=8&ved=&url=http%3A%2F%2Fsweetlysocial.net%2Fbig-data-better-marketing%2F&psig=AFQjCNFz3aw0oHgWUtn7rISU68AsVqSchw&ust=1465280114674811
[Accessed 6 Jun. 2016].
Anon, (2016). [online] Available at:
http://cra.org/ccc/wp-content/uploads/sites/2/2015/05/bigdatawhitepaper.pdf
[Accessed 6 Jun. 2016].
Sweetlysocial.net. (2014). Big Data, Better Marketing | Sweetly Social. [online] Available at: http://sweetlysocial.net/big-data-better-marketing/ [Accessed 6 Jun. 2016].
Sweetlysocial.net. (2014). Big Data, Better Marketing | Sweetly Social. [online] Available at: http://sweetlysocial.net/big-data-better-marketing/ [Accessed 6 Jun. 2016].
Kdnuggets.com. (2016). 5 Best Machine Learning APIs for Data Science. [online] Available at: http://www.kdnuggets.com/2015/11/machine-learning-apis-data-science.html [Accessed 6 Jun. 2016].
No comments:
Post a Comment