File:4-2 PA230E.jpg

An Overview of SAS Enterprise Miner

The following article is in regards to Enterprise Miner v.4.3 that is available in SAS v9.1.3. Enterprise Miner an awesome product that SAS first introduced in version 8. It consists of a variety of analytical tools to support data mining analysis.Esaote Transducer Data mining is an analytical tool that is used to solving critical business decisions by analyzing large amounts of data in order to discover relationships and unknown patterns in the data. The Enterprise Miner data mining SEMMA methodology is specifically designed to handling enormous data sets in preparation to subsequent data analysis. In SAS Enterprise Miner, the SEMMA acronym stands for Sampling, Exploring, Modifying, Modeling, and Assessing large amounts of data.

The reason that SAS Enterprise Miner has been given this acronym is that usually the first step in data mining is to sample the data in order to acquire a representative sample of the data. The next step is to usually explore the distribution or the range of values of each variable to the selected data set. This might be followed by modifying the data set by replacing missing values or transforming the data in order to achieve normality in the data since many of the various analytical tools depend on the variables having a normal distribution. The reason is because many of the nodes in Enterprise Miner calculate the square distances between the variables that are selected to the analysis. The next step might be to model the data. In other words, there might be interest in predicting certain variables in the data. The final steps might be to determine which models are best by assessing the accuracy between the different models that have been created.

The Ease of Use to Enterprise Miner

SAS Enterprise Miner is a powerful new module introduced in version 8. But, more importantly SAS Enterprise Miner is very easy application to learn and very easy to use. SAS Enterprise Miner is visual programming with a GUI interface. The power of the SAS Enterprise Miner product is that you do not even need to know SAS programming and have very little statistical expertise in the development of your Enterprise Miner project since it is as simple as selecting icons or nodes from the EM tool palette or menu bar and dragging the icons onto the EM diagram workspace or desktop. Yet, an expert statistician can adjust and fine tune the default settings and run the SEMMA process flow diagram to their own personal specifications. The nodes are then connected to one another in a graphical diagram workspace. SAS Enterprise Miner is visual programming with SAS icons within a graphical EM diagram workspace. It is as simple as dragging and dropping icons onto the EM diagram graphical workspace. The SAS Enterprise Miner diagram workspace environment looks similar to the desktop in Windows 95, 98, XP, and Vista. Enterprise Miner is very easy to use and can save a tremendous amount of time having to program in SAS. However, SAS Enterprise Miner has a powerful SAS Code node that brings in the capability of SAS programming into the SEMMA data mining process through the use of a SAS data step in accessing a wide range of the powerful SAS procedures into the SAS Enterprise Miner process flow diagram. Enterprise Miner produces a wide variety of statistics from descriptive, univariate, and goodness of fit statistics, numerous types of charts and plots, traditional regression modeling, decision tree analysis, principal component analysis, cluster analysis, association analysis, link analysis, along with automatically generated graphs that can be directed to the SAS output window.

The Purpose of the Enterprise Miner Nodes

Data Mining is a sequential process of Sampling, Exploring, Modifying, Modeling, and Assessing large amounts of data to discover trends, relationships, and unknown patterns in the data. SAS Enterprise Miner is designed for SEMMA data mining. SEMMA stands for the following.

Sample  Identify the analysis data set with the data that is large enough to make significant findings, yet small enough to compile the code in a reasonable amount of time. The nodes create the analysis data set, randomly sample the source data set, or partition the source data set into a training, validation, and test data set.

Explore  Explore the data sets to view the data set to observe for unexpected treads, relationships, patterns, or unusual observations while at the same time getting familiar with the data. The nodes plot the data, generate a wide variety of analysis, identify important variables, or perform association analysis.

Modify  Prepares the data for analysis. The nodes can create additional variables or transform existing variables for analysis by modifying or transforming the way in which the variables are used in the analysis, filter the data, replace missing values, condense and collapse the data in preparation to time series modeling, or perform cluster analysis.

Model  Fits the statistical model. The nodes predicts the target variable against the input variables by using either least squares or logistic regression, decision tree, neural network, dmneural network, user defined, ensemble, nearest neighbor, or two stage modeling.

Assess  Compare the accuracy between the statistical models. The nodes compare the performance of the various classification models by viewing the competing probability estimates from the lift charts, ROC charts, and threshold charts. For predictive modeling designs, the performance of each model and the modeling assumptions can be verified from the prediction plots and diagnosis charts.

Note: Although, the Utility nodes are not a part of the SEMMA acronym, the nodes will allow you to perform group processing, create a data mining data set to view various descriptive statistics from the entire data set, and organize the process flow more efficiently by reducing the number of connections or condensing the process flow into smaller more manageable subdiagrams.

The purpose of the Input Data Source node is to read in a SAS data set or import and export other types of data through the SAS import Wizard. The Input Data Source node reads the data source and creates a data set called a metadata sample that automatically defines the variable attributes for later processing within the process flow. In the metadata sample, each variable is automatically assigned a level of measurement and a variable role assignment to the analysis. For example, categorical variables with more than two class levels and less than ten class levels are automatically assigned a nominal measurement level with an input variable role. By default, a metadata sample takes a random sample of 2,000 observations from the source data set. If the data set is smaller than 2,000 observations, then the entire data set is used to create the data mining data set. From the metadata sample, the node displays various summary statistics for both interval valued and categorical valued variables. For the interval valued variables, it's important that the variables share the same range of values, otherwise various transformations such as standardizing, are recommended. The reason is because a large majority of the data mining analysis designs apply the squared distance between the data points. The node has the option of editing the target profile for categorical valued target variables in order to assign prior probabilities to the categorical response levels that truly represent the appropriate level of responses in addition to predetermined profit and cost amounts for each target specified decision consequences in order to maximum expected profit or minimize expected loss from the following statistical models.

The purpose of the Sampling node is to perform various sampling techniques to the input data set. Sampling is recommended for extremely large data sets to reduce both the memory resources and processing time to data mining. The node performs random, systematic, stratified, sequential, and cluster sampling. From the node, you have the option to specify the desired sample size by entering either the appropriate number of records or the percentage of allocation. The node enables you to define the method for stratified sampling. In stratified sampling, observations are randomly selected within each non overlapping group or strata that are created. The type of stratified sampling that can be performed from the node is either selecting stratified samples by the same proportion of observations within each strata, an equal number of observations in each strata, creating the stratified groups by the proportion of observations and the standard deviation of a specified variable within each group or a user defined stratified sample in which each stratified group is created by various class levels of the categorical valued variable. For cluster sampling, the node is designed for you to specify a random sample of clusters where the clusters are usually of unequal sizes, and then specifying the number of cluster to the sample based on all the selected clusters of either every nth cluster or the first n clusters. You may also specify a certain percentage of clusters based on all of the clusters that are created. The random seed number determines the sampling. Therefore, using an identical random seed number to select the sample from the same SAS data set will create an identical random sample of the data set. However, the exception is when the random seed is set to zero, then the random seed number is set to the computer's clock at run time. An output data set is created from the sample selected that is passed on through the process flow diagram.

The purpose of the Data Partition node is to partition or split the metadata sample into a training, validation, and test data set. The purpose of splitting the original source data set into separate data sets is to prevent overfitting and achieve good generalization to the statistical modeling design. Overfitting is when the model generates an exceptional fit to the data. However, fitting the same model to an entirely different random sample of the same data set will result in a poor fit to the data. Generalization is analogous to interpolation or exploration in generating unbiased and accurate estimates by fitting the model to data that is entirely different data that was used in fitting the statistical model. The node will allow you to select either simple random sample, stratified random sample, or user defined sample to create the partitioned data sets. The random seed number determines the random sampling that follows a uniform distribution between zero and one along with a counter number that is created for each data set in order to regulate the correct number of records that are allocated into the partitioned data sets. The node will allow you to perform user defined sampling where the class levels of the categorical valued variable determines the partitioning of the data. User defined sampling is advantageous in time series modeling where the data must be retained in chronological order over time.

The purpose of the Distribution Explorer node is to view multidimensional histograms to graphically view the multitude of variables in the analysis. Observing the distribution or the range of values of each variable is usually the first step to data mining. Although, the node is designed to view the distribution of each variable separately, however, the node has the added capability of viewing the distribution of up to three separate variables at the same time. In other words, the node displays up to a 3 D frequency bar chart based on either the frequency percentage, mean, or sum. The node will allow you to select the axes variables to the multi dimensional bar chart. From the tab, the node will allow you to display a frequency bar chart of each variable. For categorical valued variables, the bar chart has the option of specifying the number of bins or bars that will be displayed in the multi dimensional bar chart. For interval valued variables, the node will allow you to set the range of values that will be displayed within the bar chart. Descriptive statistics are generated for the interval valued variables by each categorical valued variable. Otherwise, if the selected axes variables are all categorical variables, then frequency tables will be generated.

The purpose of the Multiplot node is a visualization tool to graphically view the numerous variables in the analysis through a built in slide show. The node creates various bar charts, stacked bar charts, and scatter plots. The graphs are designed to observe trends, patterns, and extreme values between the variables in the active training data set. The Multiplot node gives you the added flexibility to add or remove various charts and graphs from the slide show. The slide show will allow you to browse the distribution between the variables by scrolling back and forth through the multitude of charts and graphs that are automatically created.

The purpose of the Insight node is to browse the corresponding data set to perform a wide assortment of analysis. The node opens the SAS/INSIGHT session. The node creates various graphs and statistics. Initially, when the node is opened, the node displays a table listing that is similar to a SAS Table View environment. The node can generate various charts and graphs such as histogram, box plots, probability plots, line plots, scatter plots, contour plots, and rotating plots. The node generates numerous univariate statistics, trimmed mean statistics, and robust measure of scale statistics. In addition, the node performs a wide range of analysis from regression modeling, logistic regression modeling, multivariate analysis, and principal component analysis. The node is capable of transforming and creating new variables in the data set. However, it is advised by SAS not to load extremely large data set into SAS/INSIGHT. Therefore, the node has an option of taking a random sample of the entire data set or a subset random sample of the data.

The purpose of the Association node is to identify associations or relationships of certain items or events that occur together or in a particular sequence. The node is designed to either perform association or sequential discovery analysis. This type of analysis is often called market basket analysis. As an example of market basket analysis is "if a customer buys product A, what is the probability that the customer will also purchase product B?". In other words, the node is designed to determine the association or relationships between one item or event that might be attributed to another item or event in the data. Measuring the strength of the associations or interactions between the items of the if then rule