I just recently had a customer ask me to explain to his management group the difference between a Business Intelligence (BI) Analyst and a Data Scientist. I often hear this question, and usually resort to revealing Figure 1 (BI Analyst vs. Data Scientist Characteristics chart, which reveals the various attitudinal techniques for each) …
Figure 1: BI Analyst vs. Data Scientist Characteristics
… and Figure 2 (Business Intelligence vs. Data Science, which reveals the various types of questions that each tries to address) in action to this concern.
Figure 2: Business Intelligence vs. Data Science
But these slides lack the context needed to satisfactorily answer the question– I’m never ever sure the audience really comprehends the fundamental differences between what a BI expert does and what an information scientist does. The secret is to understand the distinctions between the BI expert’s and information scientist’s goals, tools, methods and techniques. Here’s the explanation.The Business Intelligence (BI) Analyst Engagement Process
Figure 3 outlines the top-level analytic process that a normal BI Analyst utilizes when engaging with the service users.
Figure 3: Business Intelligence Engagement Process
Step 1: Build the Data Model. The process starts by building the underlying data model. Whether you utilize a data storage facility or information mart or hub-and-spoke method, or whether you use a star schema, snowflake schema, or third typical form schema, the BI Analyst need to go through a formal requirements collecting procedure with the business users to recognize all (or at least the large majority of) the concerns that the service users want to address. In this requirements gathering process, the BI analyst should identify the 2nd and first level questions business users wish to attend to in order to construct a robust and scalable data warehouse. For example:
– 1st level concern: How many patients did we treat last month?
– 1st level question: How numerous clients came through ER last night?
- 2nd level question: How did that compare to the previous month?
- 2nd level question: What were the major DRG types treated?
– 1st level question: What portion of beds was used at Hospital X last week?
- 2nd level question: How did that compare to the previous night?
- 2nd level question: What were the top admission reasons?
The BI Analyst then works closely with the data storage facility team to define and build the underlying information designs that supports the concerns being asked.
- 2nd level question: What is the trend of bed utilization over the past year?
- 2nd level question: What departments had the largest increase in bed utilization?
Note: the information storage facility is a “schema-on-load” technique due to the fact that the information schema need to be specified and built prior to loading data into the information warehouse. Without an underlying data model, the BI tools will not work.
Step 2: Define The Report. Once the analytic requirements have been transcribed into an information design, then step 2 of the process is where the BI Analyst uses a Business Intelligence (BI) item– SAP Business Objects, MicroStrategy, Cognos, Qlikview, Pentaho, and so on– to create the SQL-based query for the preferred concerns (see Figure 4).
Figure 4: Business Intelligence (BI) Tools
The BI Analyst will use the BI tool’s visual user interface (GUI) to produce the SQL query by picking the measurements and procedures; choosing column, page and page descriptors; specifying subtotals, overalls and constraints, creating special estimations (mean, moving average, rank, share of) and choosing sort requirements. The BI GUI hides much of the complexity of creating the SQL
Step 3: Generate SQL commands. Once the BI Analyst or the business user has actually specified the desired report or question demand, the BI tool then produces the SQL commands. In many cases, the BI Analyst will customize the SQL commands produced by the BI tool to include unique SQL commands that may not be supported by the BI tool.
Step 4: Create Report. In step 4, the BI tool issues the SQL commands against the information storage facility and produces the corresponding report or dashboard widget. This is an extremely iterative process, where the Business Analyst will tweak the SQL (either using the GUI or hand-coding the SQL statement) to tweak the SQL request. The BI Analyst can also define graphical rendering options (bar charts, line charts, pie charts) until they get the precise report and/or graphic that they want (see Figure 5).
Figure 5: Typical BI Tool Graphic Options
By the method, this is a fine example of the power of schema-on-load. This conventional schema-on-load method eliminates much of the underlying information intricacy from business users who can then utilize the GUI BI tools to more easily engage and check out the information (think self-service BI).
In summary, the BI approach leans greatly on the pre-built data warehouse (schema-on-load), which makes it possible for users to quickly, and quickly ask further concerns– as long as the data that they need is currently in the information storage facility. If the data is not in the information storage facility, then including data to an existing warehouse (and producing all the supporting ETL procedures) can take months to make happen.
Figure 6 lays out the Data Scientist engagement process.
The Data Scientist Engagement Process
Figure 6: Data Scientist Engagement Process
Step 1: Define Hypothesis To Test. Step 1 of the Data Scientist procedure begins with the Data Scientist determining the prediction or hypothesis that they wish to evaluate. Once again, this is an outcome of working together with business users to understand the essential sources of organisation distinction (e.g., how the organization delivers worth) and then conceptualizing information and variables that might yield better predictorsof performance. This is where a Vision Workshopprocess can include considerable worth in driving the collaboration between business users and the data scientists to identify information sources that mayhelp improve predictive worth (see Figure 7).
Figure 7: Vision Workshop Data Assessment Matrix
Step 2: Gather Data. Action 2 of the Data Science process is where the data scientist collects appropriate and/or intriguing information from a multitude of sources– ideally both internal and external to the company. The information lakeis a terrific approach for this process, as the data scientist can get any data they want, test it, ascertain its worth provided the hypothesis or prediction, and then choose whether to consist of that information in the predictive design or throw it away. #FailFast #FailQuietly
Step 3: Build Data Model. Action 3 is where the information researcher defines and builds the schema necessary to attend to the hypothesis being checked. The data researcher can’t specify the schema until they know the hypothesis that they are checking AND know what data sources they are going to be using to develop their analytic designs.
Note: this “schema on inquiry” process is significantly different than the standard information warehouse “schema on load” procedure. The information researcher does not invest months integrating all the different information sources together into an official information design first. Rather, the data scientist will define the schema as needed based upon the information that is being used in the analysis. The information scientist will likely iterate through several various variations of the schema till finding a schema (and analytic design) that adequately answers the hypothesis being checked.
Step 4: Explore The Data. Step 4 of the Data Science procedure leverages the outstanding information visualization tools to discover correlations and outliers of interest in the information. Information visualization tools like Tableau, Spotfire, Domo and DataRPM  are excellent information scientist tools for checking out the data and identifying variables that they may desire to test (see Figure 8).
Figure 8: Sample Data Visualization Tools
Step 4: Build and Refine Analytic Models. Step 4 is where the genuine data science work begins– where the information researcher starts utilizing tools like SAS, SAS Miner, R, Mahout, MADlib, and Alpine Miner to build analytic models. This is true science, child!! At this point, the data researcher will check out various analytic techniques and algorithms to attempt to develop the most predictive designs. As my information researcher friend Wei Lin shared with me, this includes some of the following algorithmic techniques:
Markov chain, hereditary algorithm, geo fencing, customized modeling, tendency analysis, neural network, Bayesian reasoning, principal part analysis, singular value decomposition, optimization, linear programming, non-linear shows and more.
All in the name of attempting to measure cause-and-effect! I do not recommend attempting to win a game of chess against one these people.
Step 5: Ascertain Goodness of Fit. Step 5 in the data science process is where the information scientist will try to determine the model’s goodness of fit. The goodness of fit of a statistical design explains how well the design fits a set of observations. A number of various analytic methods will be used to determine the goodness of fit consisting of Kolmogorov– Smirnov test, Pearson’s chi-squared test, analysis of variation (ANOVA) and confusion (or mistake) matrix.
My point isn’t that Business Intelligence and schema-on-load is bad, and data science and schema-on-query is great. It’s that they deal with different types of questions. They are various techniques, meant for different environments, and utilized at different phases in the analysis process. In the BI process, the schema needs to be constructed initially and must be developed to support a variety of questions across a wide variety of business functions. So the information model should be extensible and scalable which implies that it is greatly engineered. Think production quality. In the information science procedure, the schema is built to only support the hypothesis being checked so the information design can be done quicker and with less overhead. Think ad hoc quality.
The data science procedure is extremely collaborative; the more subject matter specialists associated with the process, the better the resulting design. And perhaps even more importantly, involvement of business users throughout the process ensures that the data researchers concentrates on discovering analytic insights that pass the S.A.M. test– Strategic (to business), Actionable (insights that the organization can actually act upon), and Material (where the value of acting upon the insights is greater than the expense of acting on the insights).
 Disclaimer: I serve on DataRPM’s Advisory Board
Disclaimer: I serve on DataRPM’s Advisory Board