Learning something about the techniques and concepts of BigData is always good before learning any BigData related technology like Hadoop. It gives you a fair idea on where things fit together. This is just a quick introduction to the concepts of BigData like definitions, applications and differences with small data.
This note can be used as a quick learner or a quick refresher for BigData concepts. For detailed learning, you may refer to the reference notes or tutorials mentioned.
Wikipedia defines BigData as follows: “Big data is a broad term for data sets so large or complex that traditional data processing applications are inadequate”.
In 2012, Gartner updated its definition for BigData as follows: "Big data is high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization."
Volume, Velocity and Variety are usually called as the 3 Vs of BigData and was introduced by META Group (now Gartner) analyst Doug Laney in a 2001 research report and related lectures. Doug Laney defined data growth challenges and opportunities as being three-dimensional, i.e. increasing volume (amount of data), velocity (speed of data in and out), and variety (range of data types and sources).
Therefore we can say that, BigData is data that is very big, that comes in very fast for processing like a continuous streaming data and may be very diverse like structured, unstructured, NoSQL database data etc.
BigData analytics have great opportunity and below are some of the current applications of BigData.
Recommendation engines provided by most shopping sites like Amazon that suggests products based on what you have bought or even searched before.
Siri is a feature that is part of Apple Inc.'s iOS which works as an intelligent personal assistant and knowledge navigator. The feature uses a natural language user interface to answer questions, make recommendations, and perform actions.
Search suggestions when you start typing on search engines, provided by various search engines like Google.
Ad targeting, through various providers like Google adwords, show you ads based on your previous searches and other activities online.
Predictive marketing, that provides the target audience or trends based on various factors such as consumer behavior, demographic info like age, salary etc. which are readily available or which may be also purchased.
Fraud detection, especially credit and debit card or online usages, based on point of sale, geo-location and IP, login time and even biometric details like time to make mouse movements.
Google Flu Trends, that uses aggregated Google search data to estimate flu activity.
NASA’s Kepler, a space observatory launched by to discover Earth-like planets orbiting other stars, continuously transmits data to Earth, and data is then analyzed to detect periodic dimming caused by extrasolar planets that cross in front of their host star.
The book “Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information” by “Jules J. Berman” lists 10 ways that big data's different from small data. Here is a quick summary of them.
Small data usually has a specific goal or purpose.
BigData may have one goal or purpose in the beginning, but might take unexpected turns later.
Small data usually reside in one file or one place, like a single computer.
BigData may spread across multiple files or computers, and even multiple geographical locations.
Data structure and contents
Small data is usually structures, like an RDBMS or an excel.
BigData may be unstructured and may belong to different file formats.
Small data is usually prepared by the end user for their own use.
BigData may be prepared, analyzed and used by different groups of people.
Longevity (life expectancy)
Small data is usually kept for a specific period of time and may be deleted or archived after that time period.
BigData usually stays for longer periods and new data may be added to the existing data set.
Small data is usually measured with same protocol or unit of measurement.
Big data may be measured with different protocols or units of measurement, and may also involve some conversions to make the units consistent for analysis.
Small data can be reproduced in its entirety if something goes wrong in the process, as it usually will be coming from a single source and is easy to recreate.
BigData may come from various sources and hence many not be able to reproduce in its entirety.
Cost, if something goes wrong to the data set, is limited in case of small data.
Cost, if something goes wrong with BigData can be very high, even to the extent of affecting the researcher and even the organization.
Small data may carry some additional data (e.g. a descrption element tag) that describes the content, to allow easier introspection.
BigData might not carry description data for all content, and may contain unidentifiable, un-locatable, and meaningless data.
Note that BigData and related processing might not be a candidate for all data scenarios because of these limitations.
It is easy to analyze small data as it usually stays in a single computer.
Since BigData may be spread across many computers, and hence analysis of BigData may involve many tasks such as abstraction, reviewing, reducing and finally aggregating results.
Many researchers have introduced many Vs to the list of 3Vs, saying 3Vs are not enough. Bernard Marr in an article in linkedin describes 5Vs: Volume, Velocity, Variety, Veracity and Value. An article in dataconomy.com says about 7 Vs: Volume, Velocity, Variety, Veracity, Variability, Value, Visualization. Lynda.com’s Techniques and Concepts of Big Data with Barton Poulson, mention around 10 Vs: Volume, Velocity, Variety, Veracity, Validity, Variability, Value, Venue, Vocabulary, and Vagueness.
Below is a quick description of most of these additional Vs:
Veracity (conformity to facts; accuracy)
Refers to the trustworthiness of the data. Does the data actually contain enough information to make accurate conclusions.
Is the data clean and well-managed, meeting the requirements for the data processing.
Data or its meaning can change over time and even place, due to many uncontrolled factors; you need to measure and account for them.
Does it have enough value for the time and effort you invest in them.
Once data has been processed, presenting the data in a manner that’s readable and accessible.
If the venue or location affect access to the data.
Vocabulary refers to the metadata that describes the data and is significant especially when combining data from very different sources. Different sources may call the same thing with different names.
If the researcher is clear on the goals and purpose of the research; or else he will be wasting a lot of time.
This should be the section everyone would be looking for: how Hadoop is related to BigData.
Hadoop is not a single product, but a collection of applications.
There is a core set of components in Hadoop like:
HDFS, that stands for Hadoop Distributed File System
MapReduce, is process that splits tasks into pieces (mapping) and then combines the result (reducing).
YARN, which is a newer version of MapReduce.
There are also additional technologies like Pig, Hive, HBase, Storm, Spark, Shark etc. that works along with the core set of HDFS, MapReduce and YARN and help in processing BigData.
Defenition: Hadoop is a distributed platform for storage and analysis that can help us in doing BigData analytics in a reliable, scalable and affordable way.
HDFS is responsible for providing reliability (by replicating the data in a cluster of nodes) and scalability (by allowing to add more nodes to the setup, without affecting other nodes).
MapReduce is a programming model implementation with a parallel, distributed algorithm on a cluster, and is designed for processing and generating large data sets much faster.
Hadoop is also an affordable solution because it runs on one or many regular commodity hardware and is also open source.
We will see more about Hadoop in detail later.
This is not a blog post, but a JavaJEE note post (Read more @ http://javajee.com/javajee-note-posts).
Lynda.com’s Techniques and Concepts of Big Data with Barton Poulson.
Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information by Jules J. Berman.
Wikipedia pages for all products listed here (if available).
Fill the following form selecting course as Big Data and Hadoop: http://javajee.com/content/volunteer-learning-program.