Any discussion about Big Data will not be complete without discussing about Data Science and its relation with Big Data.
Data Science can be considered as the extraction of knowledge from large volumes of data that are structured (e.g. RDBMS, Excel) or unstructured (e.g. emails, videos, photos, social media, and other user-generated content). Data Science may be considered as a continuation of the field of data mining and predictive analytics.
Data that scale to Big Data are of particular interest in data science, although the discipline is not generally considered to be restricted to such data. Data science actually employs techniques and theories drawn from many fields such as nanotechnologies, physics, robotics, mathematics, statistics, information theory and information technology.
Data Scientists are qualified people with strength and patience to tunnel through lots of information and the technical skills in writing algorithms to extract insights from these mountains of information. Data scientists apply expertise in data preparation, statistics, and machine learning to investigate complex problems in many various domains, such as marketing optimization, fraud detection, setting public policy, etc.
While some see no distinction between data science and statistics, some consider it is a distinct field with specific skill sets, training techniques and goals. For the purpose of this note, we will assume that Data Science is more than just statistics.
Lynda.com’s Techniques and Concepts of Big Data with Barton Poulson, describe about three facets of Data science, which are coding, statistics and domain knowledge. It also says about the Data Science Venn Diagram.
Statistics is the mathematical knowledge or training (e.g. probability) and helps in generating the right results.
Domain knowledge is the knowledge about the domain in which the research is done (e.g. Marketing) and is very important for a proper research. According to many researchers like Svetlana Sicular of Gartner, it is easier to turn domain people into Hadoop than making Hadoop people gain the domain knowledge.
A fair amount of coding knowledge (even a little bit), can be handy in many areas such as creating exploration and manipulation of data sets, transformations of data from various sources into common formats before processing etc. Having knowledge in coding also helps in Algorithmic thinking to get through a problem.
Another version of the Venn diagram I could find, describe the three facets as:
Math and statistics knowledge (statistics)
Substantive expertise (domain knowledge)
Hacking skills (coding).
You can read more from the reference links.
Combination of different facets of Data Science
According to the Venn Diagram, different combination of skills has some significance:
Combination of Statistics and Domain knowledge is often what traditional researchers possess.
Statistics and coding together can result in machine learning based researches and applications. A n email spam filter is an example.
Combination of Domain knowledge and Coding, without statistics, is considered as a danger zone, as you are very unlikely to derive successful conclusions without statistics.
Finally, a combination of Statistics, Domain knowledge and Coding, is what can be called as Data Science.
Almost everyone talk about "data science," "big data," and "analytics." However, there is a lack of clarity around the skill sets and capabilities of their practitioners. This lack of clarity has frequently led to missed opportunities.
To address this issue, the authors of the book “Analyzing the Analyzers” surveyed several hundred practitioners via the Web to explore the varieties of skills, experiences, and viewpoints in the emerging data science community, and has documented in the book. Here is a quick summary of it:
Data scientists were classified into four categories, with subtypes:
Researcher, Scientist, Statistician
Jack of all trades, Artist, Hacker
Leader, Businessperson, Entrepreneur
The book also classified the skill sets into 5 categories, with sub skills:
Product Development, Business
Unstructured data, Structured Data, Machine Learning, Big and Distributed Data
Optimization, Math, Graphical Models, Bayesian/Monte Carlo statistics, Algorithms, Simulation
System Administration, Back End Programming, Front End Programming.
Visualization, Temporal Statistics, Surveys and Marketing, Spatial Statistics, Science, Data Manipulation, Classical Statictics.
The book finally finds out what all skill categories and their percentage are available for each data scientist category.
Each data scientist type category were having some knowledge from all skillset categories, but the distribution percentage of skillset category per data scientist category varied from one data scientist category to another. For instance, Data Businessperson had a high percentage of Business skill set and Data Developer had high percentages of ML/Big Data and Math/OR skills.
You can find the distribution as per the research in “Chapter 3: A Survey of, and About, Professionals”, under the heading “Combining Skills and Self-ID”.
According to Lynda.com’s Techniques and Concepts of Big Data with Barton Poulson, the three facets of Data science (Coding, Statistics and Domain Knowledge) apply even when the three Vs of Big Data (Volume, Velocity and Variety) are not present at the same time:
Only Volume – When lot of static data is there.
Only Velocity – When streaming data comes in and only a small window is analyzed at a time.
Only Variety – Static but complex data like face recognition, data visualization etc.
Lynda.com’s Techniques and Concepts of Big Data with Barton Poulson say about below valid cases:
Big Data with only Coding and Statistics
This is where Machine Learning fits
E.g. spam filter, facial recognition
Big Data with Coding and Domain Knowledge
E.g. Word Count, Natural Language Processing.
Big Data with only Statistics and Domain Knowledge (without no knowledge at all of coding) is not possible.
Big Data is also not possible with only one of Coding, Statistics and Domain Knowledge.
This is not a blog post, but a JavaJEE note post (Read more @ http://javajee.com/javajee-note-posts).
Lynda.com’s Techniques and Concepts of Big Data with Barton Poulson
Analyzing the Analyzers -An Introspective Survey of Data Scientists and Their Work by Harlan Harris, Sean Murphy, Marck Vaisman.
Fill the following form selecting course as Big Data and Hadoop: http://javajee.com/content/volunteer-learning-program.