Monday, 30 May 2016

Scientific Discovery is the game, Data Scientists are the players

Data Science (or more specifically Data mining) has been used for a long time for business oriented applications and solutions, namely stock trading, credit scoring, and mail service optimization, among others. For these activities, a lot of data has to be processed, in order to see patterns and tendencies of individuals or singular entities (Read, 2010).
With the rising of cheap computational power and fast internet connections, Data science approaches for these activities are ideal, because they provide an integrated framework to process large amount of data, and learn from the data in a very optimized way, with a reduced interaction from human beings; after setting the attributes for creating the model, they let the machines learn, analyze, create the model, allowing analyst to evaluate the model.

From Business to Scientific Research

But a very important characteristic of Data mining approaches is the ability to extract human readable rules from the data, significant from a statistical point of view, which an expert can evaluate and use for practical purposes (Embcrechts, et al., 2005).

In science, a great part researching is analyzing great amount of data to support or refute hypothesis, and to find discernable patterns. We can take the example of pattern recognition of brain scans or electrocardiograms, to find problems or detect defect that might give us some insight on early detection of abnormalities. (Vasileios, et al., 2000)

Steps of the Scientific Method


Figure 1 – Steps of the Scientific Process (Shuttleworth, 2013)


And that’s when Data Science comes into play, giving the power and flexibility of deriving patterns from the enormous amount of data, in a fraction of the time.

Bioinformatics: Data science to Save Lives

For a long time, scientist researching potential cures and early detection technique for diseases have relied on visual identification of patterns in acquired tissue samples, analyzed through microscopes and medical devices in laboratories, which sometimes took a lot of time.

But now several enterprises and researchers are shifting from that paradigm to the Data science approach. We can name the research of the University of California, Santa Cruz (Schatz, 2015). With traditional procedures, doctors have conducted treatment of cancerous tumors depending on the part of the body it presents itself. But UCSC is working on a Cancer Genome atlas to process and cross reference tumors and is trying to find similarities among seemingly different types of cancer, to improve detection and treatment.

Bioinformatics Jobs
Figure 2 – Bioinformatics Twitter Hashtags (Bioinformatics Jobs, 2016)

We can also mention the case of pharmaceutical conglomerate Novartis, who made great advances in the field of detecting kidney disease. The company claimed that a team, resulting from a coalition with Necker Children’s Hospital-Imagine Foundation in Paris discovered a previously unnoticed gene abnormality that caused focal segmental glomerulosclerosis in just six weeks, using Big Data (Brien, 2013).
Certainly, there’s no shortage of Big Data advancements in the field of Bioinformatics, with numerous companies actively developing and using Data Science Software for research (KDnuggets, 2000).

Unveiling the Cosmos with Data Science

The vast immensity of the universe makes its study a prime field for Big Data. The amount of data produced can reach  25 Zetta-bytes a year, doubling each year, according to Moore’s law (Stephens, et al., 2015). These levels of information have been reached thanks to advances in telescope building and detectors sensitivity. 

Figure 3 – Four Domains of Big Data (Stephens, et al., 2015)

But the problem is not only in storage, but also in processing. That is why projects such as GALEX and Kepler Space Telescope have enormous data and image processing frameworks, and ALMA and Square Kilometer Array projects have bigger data infrastructure planned (Andersen, 2012).

Moving Physics to Hyper Drive

The CERN has been active in media outlets with its latest discoveries, which incurs in the big problem of having extremely large sets of data to go through. But thanks to its data processing frameworks and capabilities, it has succeeded in numerous discoveries, including the Higgs boson discovery with its Large Hadron Collider and Daya Bay Reactor Neutrino Experiment, which looks to acquire better understanding neutrino, a subatomic produced by decaying radioactive elements (Prabhat, 2015)
These endeavors have been so successful, CERN is planning on expanding its Big Data analysis framework, by updating its detectors and processing units in the hopes of improving its understanding of dark matter (University of Bristol, 2016).

Data Science in the Aid of All Sciences

These are only examples of scientific players that entered the Data Science game, but any field that has large amounts of data can take advantage of big data processing and prediction models
Data Science is even present in scientific researching as a whole, with a company called Iris AI, which developed a machine learning algorithm that allows researchers to find relevant publications by inputting a text explaining the subject at hand (Frank, 2016).

Figure 4 – Data Processing approach used at CERN (Jones, 2011)

So it is only a matter of time until other disciplines adopt Data Science as an intrinsic part of the scientific process of research and discovery.

Bibliography

Andersen, R., 2012. How Big Data Is Changing Astronomy (Again). [Online]
Available at: http://www.theatlantic.com/technology/archive/2012/04/how-big-data-is-changing-astronomy-again/255917/
[Accessed May 2016].
Bioinformatics Jobs, 2016. Twitter. [Online]
Available at: https://twitter.com/bioinformaticsj
[Accessed May 2016].
Brien, T. O., 2013. Surfing the wave of big data analytics. [Online]
Available at: https://www.novartis.com/stories/discovery/surfing-wave-big-data-analytics
[Accessed May 2016].
Embcrechts, Szymanski & Sternickel, 2005. Chapter 10: Introduction to Scientific Data Mining. In: Computationally Intelligent Hybrid Systems. New York: s.n., pp. 317-365.
Frank, A., 2016. Machine Learning’s Next Trick Will Transform How Research Is Done. [Online]
Available at: http://singularityhub.com/2016/05/26/machine-learnings-next-trick-will-transform-how-research-is-done/
[Accessed May 2016].
Jones, B., 2011. Massive Computing at CERN and lessons learnt. [Online]
Available at: http://slideplayer.com/slide/6388912/
[Accessed May 2016].
KDnuggets, 2000. Bioinformatics Companies. [Online]
Available at: http://www.kdnuggets.com/companies/bioinformatics.html
[Accessed February 2016].
Prabhat, 2015. Big science problems, big data solutions. [Online]
Available at: https://www.oreilly.com/ideas/big-science-problems-big-data-solutions
[Accessed May 2016].
Read, B., 2010. Data Mining and Science?. [Online]
Available at: http://www.ercim.eu/publication/ws-proceedings/12th-EDRG/EDRG12_Re.pdf
[Accessed May 2016].
Schatz, R. D., 2015. Decoding and Defeating Cancer with Data Science. [Online]
Available at: http://www.slate.com/articles/health_and_science/ucsc2015/2015/04/decoding_and_defeating_cancer_with_data_science.html
[Accessed May 2016].
Shuttleworth, M., 2013. What is Research?. [Online]
Available at: https://explorable.com/what-is-research
[Accessed May 2016].
Stephens, et al., 2015. Big Data: Astronomical or Genomical?. [Online]
Available at: http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002195
[Accessed May 2016].
University of Bristol, 2016. Dark Matter search enhanced by LHC’s new turbocharged ‘Brain’. [Online]
Available at: http://www.bristol.ac.uk/news/2016/may/dark-matter-search.html
[Accessed May 2016].
Vasileios, et al., 2000. Data mining in brain imaging - Abstract. [Online]
Available at: http://smm.sagepub.com/content/9/4/359.abstract
[Accessed 2016 May].


Monday, 23 May 2016

Data Science And Scientific Discovery


We live in a world where every day there is a new challenge. As we solve these challenges, we create volumes of information that change the way we perceive the world around us. This systematic method of research, directed at understanding every aspect of our perceivable universe based on evidence, is what we call as science.

THE ROLE OF SCIENCE IN HUMAN HISTORY

The history of science claims a timespan from ancient history to present time. During this period, there has been an emergence of numerous scientific revolutionaries, methods and scientific discoveries. According to Britannica,

“A new view of nature emerged, replacing the Greek view that had dominated science for almost 2,000 years. Science became an autonomous discipline, distinct from both philosophy and technology and came to be regarded as having utilitarian goals”- (Encyclopedia Britannica, 2016)

Over the centuries, the domain of science has expanded exponentially. To a great extent, it has helped us to answer what is happening, why it is happening and what will happen in a wide array of fields such as Astronomy, Biology, Ecology, Genetics, Physics and many more.

So far, scientific research has followed the traditional approach of deductive reasoning. In the deductive process, a hypothesis is created and then experiments are carried out to test its validity. Unfortunately, with this approach, it could be years before sufficient data is gathered from tests to support the claim and back it up with resounding and definitive results.



Figure 1 - Data Intensive Science (Slideshare.net, 2016)


However, this approach has been changing, and at an exponential speed thanks to the advances in knowledge and technology. As an evidence of the speed of growth of the field, today we have access to 2.5 quintillion bytes (Storagenewsletter.com, 2016) of data every day and cheap computational processing power at hand. This renewed scenario has made new techniques of research possible, which were not possible earlier, due to technological and. The field that combines data-oriented techniques is known as “Data Science”.


DATA SCIENCE: A SHIFT OF PARADIGM



Data science is a multidisciplinary field that combines the power of machine learning, artificial intelligence, data mining, statistics, applied mathematics, and visualization. The field also focuses on providing the ability to perform both deductive and inductive reasoning. While the former is hypothesis driven, the later focuses on refining existing hypothesis or generating new hypothesis by spotting interesting patterns available in huge heterogeneous and unstructured data. This approach of data science is helping the scientific community to accelerate the rate of scientific discoveries.



Figure 2 - TimeLine data science (→, 2015)


DATA SCIENCE IN THE AID OF SCIENTIFIC DISCOVERIES


To give an example, In the field of high-energy particle physics, there are instruments such large hadron collider (O'Reilly Media, 2015), that are used to break open atoms and examine its constituents. This process produces exabytes of data, which makes its analysis dependent on powerful supercomputers and advanced data science techniques. These techniques have recently led to the discovery of Higgs-Boson particles and this is considered as a landmark achievement in the history of particle physics.

Another example worthy to be mentioned is from the field of genetics, where the researchers, in order to understand the relationship between complex diseases and genetic effects (Feero, Guttmacher and Manolio, 2010), are also using data science techniques and so far they have been able to identify connections between 2000 genes and 300 common human diseases traits.




Figure 1Large Hydron Collider (Apod.nasa.gov, 2016)  and Genome (IFLScience, 2015)

WHERE DATA SCIENCE IS GOING IS YET TO BE TOLD

The potential of data science is vast and inspiring. Watch this space for more information on data science. During the coming weeks, a deeper study will be performed, starting by providing details on the active users of data science for scientific discovery, the challenges they face and the ways they solve the problems.





Tags: data science, big data, scientific discovery, research, data processing, data analytics, scientific process, inductive process.
Bibliography
Encyclopedia Britannica. (2016). physical science | Definition, History, & Topics. [online] Available at: http://www.britannica.com/science/physical-science [Accessed 22 May 2016].

Storagenewsletter.com. (2016). StorageNewsletter » Every Day We Create 2.5 Quintillion Bytes of Data. [online] Available at: http://www.storagenewsletter.com/rubriques/market-reportsresearch/ibm-cmo-study/ [Accessed 22 May 2016].

O'Reilly Media. (2015). Big science problems, big data solutions. [online] Available at: https://www.oreilly.com/ideas/big-science-problems-big-data-solutions [Accessed 22 May 2016].

Feero, W., Guttmacher, A. and Manolio, T. (2010). Genomewide Association Studies and Assessment of the Risk of Disease. New England Journal of Medicine, 363(2), pp.166-176.

Anon, (2016). [online] Available at: http://renci.org/wp-content/uploads/2015/11/SCi-Discovery-BigData-FINAL-11.23.15.pdf [Accessed 22 May 2016].

Anon, (2016). [online] Available at: https://www.boozallen.com/content/dam/boozallen/documents/2015/12/2015-FIeld-Guide-To-Data-Science.pdf [Accessed 23 May 2016].

→, V. (2015). History of Data Science (Infographic). [online] What's The Big Data?. Available at: https://whatsthebigdata.com/2015/02/17/history-of-data-science-infographic/ [Accessed 23 May 2016].

Apod.nasa.gov. (2016). APOD: 2011 December 18 - Hints of Higgs from the Large Hadron Collider. [online] Available at: http://apod.nasa.gov/apod/ap111218.html [Accessed 23 May 2016].

IFLScience. (2015). Entire Human Genome Can Now Be Sequenced For Just $1,000. [online] Available at: http://www.iflscience.com/health-and-medicine/entire-human-genome-can-now-be-read-1000 [Accessed 23 May 2016].

Slideshare.net. (2016). The fourth paradigm: data intensive scientific discovery - Jisc Digif…. [online] Available at: http://www.slideshare.net/JISC/the-fourth-paradigm-data-intensive-scientific-discovery-jisc-digifest-2016/4 [Accessed 23 May 2016].