Big Data in Data Science

Data science is an interdisciplinary field that uses systematic procedures to study the methods of collecting, storing, and analyzing data. The aim of data science is to acquire insights and knowledge from any type of data – both structured and unstructured.

What is Big Data?

Big Data refers to describe the large volume of structured, semi-structured, and unstructured data.  Big data can be characterized by three Vs: the volume of data, the velocity at which the data is generated and collected & the variety of the information.

Source: ttps://www.coursera.org/learn/data-scientiststools/ungradedWidget/8x1jm/bigdata

The Three Vs of Big Data:
Getting started with big data requires three key steps-

  • Integrate
  • Manage
  •  Analyze

The definition of “Big” has evolved with the advancement of technology & data storage capability to be able to hold larger data sets. Also, our capacity to collect & store the data has upgraded with time such that the speed for data collection is unprecedented. The idea of “data” has developed gradually; as a result internet & technology have recognized to collect different categorical data for analysis. One of the main objectives in data science has been moving from structured data sets to tackling unstructured data.

Structured & Unstructured Data:

Structured data means our usual idea about data i.e. long tables, spreadsheets, or databases with columns and rows of information that one can sum or average or analyze. Unfortunately, we encounter the data sets which are much messier & the job of data scientist is to present the data sets into something neat & tidy format. With the advancement of internet & technology, many pieces of information that weren’t traditionally collected were suddenly able to be translated into a format that a computer could record, store, search, and analyze. Presently, the unstructured data being collected from all of our digital interactivity: Emails, Face book, Instagram, YouTube, Twitter, SMS, shopping habit, use of Smartphone, CCTV cameras & other video sources, etc.

Is Big Data a Volume or Technology?

The term “Big data” may seem to suggest large data set but it may refer to the technology when used by vendors. The technology incorporates the tools & processes to operate the massive volume of data and storage facilities.

Advantages and Disadvantages of Big Data:

The challenges of working with big data are-

  • Big: Massive volume of raw data that you need to be able to store and analyze.
  • Variety: Sometimes, it can be difficult to decide the source of data due to the variety of sources of information.
  • Messy: In reality, the collected data can be messy. You need to turn the unstructured data into a format that can be analyzed.
  • Update: The technology is changing at a rapid pace. First, Apache Hadoop & Apache Spark was used to solve big data problems. Now, the hybrid frameworks are used to be the best approach.

The advantage of working with big data are-

  • Neglect error: There are many sources of error in data collection. If there are any errors in the data, the volume of the data set can negate the small errors in it.
  • Accurate decision: Up to date information allows you to make analysis to the current state of the system & suggest rapid, informed predictions and decisions.
  • Answer the unfeasible queries: The unconventional data sources may allow you to answer the previously inaccessible and formerly unfeasible questions. Big data can enable you to obtain more complete answers than before.
  • Identify hidden correlations: The big data can recognize the hidden relation between outcome variable & input variable which may not be related to our outcome variable.

The application of Big Data:

Big Data can assist the company to identify a range of business ventures, from customer experience to analytics. Few examples are

  • Product Development
  • Customer Experience
  • Fraud and Compliance
  • Machine Learning
  • Operational Efficiency

Conclusion:

A famous statistician, John Tukey, said in 1986, “The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data.” Likewise, any big data set may not answer all the queries if it’s not the right data. So, we can conclude that data science is question-driven science and even the largest data sets may not be always suitable.

Forecasting of Covid-19 using mathematical modeling & machine learning

The World Health organization has declared Covid-19 as a pandemic. As of April 29, an outbreak of Covid-19 has confirmed more than 3 million confirmed cases & more than 200,000 deaths worldwide.  Various mathematical/statistical models have been developed to discuss the transmission dynamics. However, forecasting the future case in real time may not be accurate. We need to develop Artificial Intelligence  (machine learning & deep learning) based mathematical/statistical models to overwhelm the constraint of epidemiological model approach.

Model Formulation:

Researchers follow few basic steps to construct any model. The steps are:

  1. Define: First we need to define the problem. The model selection depends on the amount of data availability. It may also refer to the difficulty of choosing the characteristic model from a large set of the computation model for decision making.
  2. Fit: Most essential part of the modeling. It captures the trajectory of the system with real time data.
  3. Predict:  Predictive modeling is a method using machine learning technique to predict an outcome or event.
  4. Evaluate:  Accuracy, precision & recall are three metrics to evaluate any model.

Last few weeks, we have observed various graphs/charts to predict the projected cases of Covid-19, but many of these models have been developed using data from previous research like SARS or MERS.  Recently, researchers from MIT have developed a neural network using the data of Covid-19  to estimate the effective reproduction number. The effective reproduction number is one of the most important metric in epidemiology. It can be defined as “the average number of secondary cases per infectious case in a population made up of both susceptible and non-susceptible hosts.”  The effective reproduction number  is greater than one means epidemic has continued to spread exponentially & less than one indicates the point where we can flatten the curve & observe less infection.

Classical Epidemiological models predict the growth of disease, which groups total population into susceptible (S), exposed (E), infected (I) & recovered (R) populations.  Recently, the SEIR model has extended by training a neural network to capture the transmission dynamics of Covid-19 [1]. The investigation relies on the surveillance data of Covid-19. The machine learning algorithm estimates the “infection plateau” for United States as somewhere between second-third week of April, 2020 (Figure 1).  This projection of infected cases is similar to other predictions like Institute of Health metrics & Evaluation.  The MIT model also suggests that quarantine policy & lock down policy are successful in getting the effective reproduction number less than one.

For healthcare system, it looks like promising news in terms of the number of confirmed cases, but it should not be considered as to start alleviating the control measures. Early termination of lock down could activate a catastrophic second wave with a sharper & rapid secondary peak.

Figure 1: The figure depicts the model forecasting of confirmed infected cases in United States (Source: Reference [1]).

Reference:

  1. Dandekar, Raj, and George Barbastathis. “Quantifying the effect of quarantine control in Covid-19 .infectious spread using machine learning.” medRxiv(2020).
Skip to content