According to the concept of the 3Vs, BigData is data that may be very big (Volume) that may come in very fast for processing like a continuous streaming data (Velocity) and may be very diverse like structured, unstructured, NoSQL database data etc (Variety). Data is the most important part in BigData; if there is no data, then there is no BigData. So we will discuss about how data is generated, data types, where the data is stored and also various challenges with managing and processing.
The data in Big Data may be generated by humans or by machines.
Human Generated Data
Humans may generate data either knowingly or unknowingly.
Intentional data generated by data include photos, videos and text, shares and likes in social media etc.
Metadata is data about data, and often accompanies the data contents without being noticed by the end user. This data is usually machine-readable as they usually follow some protocol.
Examples of metadata include:
Photograph Exif metadata that will contain additional info like the location, time etc. when the image was taken.
Cellphone metadata will contain the location and time details of the call.
Email metadata will contain many additional data like to, cc, from etc.
A Twitter tweet contain lot of metadata, even much bigger than the tweet content.
A lot of data is generated by machines or devices along with other processing.
Machine generated data usually follow some protocol and hence it can be easily read and analyzed than human generated data.
Example sources for machine generated data are:
Cell phones connecting to towers exchange data
Reading from medical devices
Web crawlers and spam bots.
There are many uses for these automatically generated data:
Monitoring production lines
Identifying and monitoring pets
Data in BigData may be structured or unstructured, compared to general relational data which will be mostly only structured.
Data is said to be structured if it is placed in files with fixed fields or variables.
Examples include Relational Database Management Systems (RDBMS) and Excel Spreadsheets, which arrange data into tabular form with fixed rows or columns.
Data is said to be completely unstructured if they are not mapped to any fields or variables.
Examples for unstructured data include text, presentations, images etc.
Majority of the business data are unstructured.
Semi structured data is data that is not fully structured like an RDBMS or not completely unstructured. Semi structured data may have fields and variables, that can be marked, making the data identifiable; but these fields and variables may not be fixed as in the case of an RDBMS.
These data can still be arranged hierarchical and nested without a fixed structure of fields and variables.
Examples for semi structured data include XML and JSON data, without following any predefined structure.
NoSQL databases follow a semi structured format, and usually uses formats such as JSON to store data. Data will be stored in key value pairs, but different rows may contain different keys.
For instance, first row for a person may contain name, salary etc. and second line may contain name, city etc.
MongoDB is a popular NoSQL database that uses JSON format for saving data.
HBase is a NoSQL database popularly used with Hadoop.
NoSQL databases are not yet widely used as is in the case of RDBMS today and does not have a standard query language like the SQL for RDBMSs. NoSQL databases and RDBMS actually are suited for different scenarios.
If the data is not relational, then NoSQL is definitely the way to go.
Since volume of data is too big in Big Data, data may be stored across disks.
Traditional systems for storing distributed data include SAN and NAS.
A storage area network (SAN) is a dedicated network that provides access to consolidated, block level data storage. SAN may consist of large and expensive collection of disk drives on Racks.
Network-attached storage (NAS) is a file-level computer data storage server connected to a computer network providing data access to a heterogeneous group of clients.
NAS provides both storage and a file system. This is often contrasted with SAN (Storage Area Network), which provides only block-based storage and leaves file system concerns on the "client" side.
Both SAN and NAS may be expensive solutions and may also require expertise to maintain them.
HDFS is Hadoop Distributed File System and can store data across many systems in reliable and scalable way.
Hadoop is more affordable as it is designed to run on regular commodity hardware, and also requires less expertise than required for managing SAN or NAS.
Storing data on cloud has many advantages, including:
Scalability (and even cost) – Can increase storage as needed, paying for only what you use.
Redundancy – Data may be stored redundant to avoid data loss and high availability.
Speed – Data may be usually stored in high speed SSD drives.
Amazon S3 is an example.
Hadoop can be easily installed in these cloud servers, thus benefitting from cloud computing along with the benefits of HDFS.
Apart from the standard cloud services such as Iaas (Infrastructure as a Service), Paas (Platform as a Service) and Saas (Software as a Service), we can also have Daas (Data as a Service).
DaaS provides faster and reliable access to data.
You can buy access to data for research or other data analytics.
Challenges with Anonymity
Privacy is an important concern when processing data from various sources.
Known or unknown violations of privacy can lead to serious issues.
One possible solution is to anonymize the data set.
However the dataset might be de-anonymized.
Another practice followed to protect public records is to remove HIPAA protected info.
Challenges with Confidentiality
While anonymity concerns refer to identifying the person from public records, confidentiality deals with not making non-public info available to public.
Protecting confidentiality help in maintaining trust.
Some of the possible solutions or precautions are:
Protect and access data
May purchase data breach insurance
Store only required data
Challenges with Data Quality
Input data can have errors or problems and bad input may in turn lead to bad results.
Possible errors or problems in input data are:
Incomplete or corrupted data records.
May lead null pointer error, divide by zero error etc.
Data that is missing context or missing measurement information like unit of measurement.
Incomplete or incorrect transformations of data.
Though these problems are applicable to small data as well, it is more of concern with Big Data and Hadoop as:
Normal method for checking accuracy of data may not be applicable for Big Data.
If data stays in Hadoop, it may not be converted; and if data is not converted, it may not be examined.
Special Challenged with Big Data
Responsibilities are spread out.
Different people or groups prepare, analyze and apply the data.
There is a greater need to think about quality.
There is a greater need to think about the meaning of the project.
This is not a blog post, but a JavaJEE note post (Read more @ http://javajee.com/javajee-note-posts).
Lynda.com’s Techniques and Concepts of Big Data with Barton Poulson.
Fill the following form selecting course as Big Data and Hadoop: http://javajee.com/content/volunteer-learning-program.