Kattamuri S. Sarma, PhD. Predictive Modeling with SAS® Enterprise Miner™. Practical Solutions for Business Applications. Second Edition. Predictive Modeling with SAS® Enterprise Miner™: Practical Solutions for Business Applications,. Third Edition. ISBN (PDF). All Rights. Predictive Modeling with SAS® Enterprise Miner™: Practical Solutions for Business Applications,. Third Edition ISBN (PDF). All Rights.
|Language:||English, Dutch, German|
|ePub File Size:||26.58 MB|
|PDF File Size:||13.16 MB|
|Distribution:||Free* [*Register to download]|
Sources of Modeling Data. Data Cleaning Before Launching SAS Enterprise Miner. Chapter 2: Getting Started with Predictive Modeling. If you are a graduate student, researcher, or statistician interested in predictive modeling; a data mining expert who wants to learn SAS Enterprise Miner; or a. analytics projects in SAS Enterprise Miner. • Highlight some of the Predictive modeling: The use of known, historical data and mathematical techniques .. http ://pukosirare.tk Methods for.
Logistic Regression. Applications of predictive modelling are unlimited; some examples of areas of.
Predictive Modeling with SAS® Enterprise Miner
SAS Institute Inc. Predictive modelling using logistic regression SAS Institute. Predictive models; data mining; supervised learning; propensity to download; logistic training validation and test data for better modelling and accuracy Christie et al. What is this course all about?
It explains. There course promises to explain. GMT predictive modeling using logistic regression pdf. Course Notes pdf. Sat, 27 Oct GMT predictive modeling using logistic regression pdf. Therefore, statisticians must consider the state of the data and the applicability of the statistical models.
Working with complex data provides tremendous opportunities for statisticians who master the techniques of data mining. Another major difference concerns the size of the data sets. In statistics, with the primary concern being inference, p-values and statistical significance are the primary measures of model effectiveness.
However, data mining typically involves large data sets that are observational rather than random. The confidence width and effect size in such large samples decreases to 0 as the sample size increases.
It is not unusual to have a regression model with every parameter statistically significant but with a correlation coefficient of almost 0. Therefore, other measures of model effectiveness are used.
Predictive Modeling with SAS Enterprise Miner Solutions Manual
In data mining, the data sets are usually large enough to partition into three types: training, testing, and validation. The training data set is used to define the model, the testing data set is used in an iterative process to change the model if necessary to improve it, and the validation data set represents a final examination of the model.
Depending upon the profit and loss requirements in the data, misclassification is used in supervised learning where there is a specific outcome variable. Although this problem can be dealt with using statistical linear models by including potential confounders in the model, in practice, the data sets collected are too small to include all confounders that should be considered.
Because data mining tools can include hundreds and sometimes thousands of variables simultaneously, potential confounders can and should always be considered in the data mining process. The models are based on analysis of vast amounts of data from across an enterprise. Interactive statistical and visualization tools help you better search for trends and anomalies and help you focus on the model development process. Display 1. You can access an existing project or create a new project.
If you are using the software for the first time, select New Project. To access an existing project, select Open Project. Specify a name for the project in the Path field. Start-up code allows you to enter SAS code that runs as soon as the project is open; Exit Code runs every time the project is closed. The purpose of creating a separate library name for each project is to organize the data files. Once you have created a new project, the Enterprise Miner—Workloads window appears see Display 1.
The menus on the left change based on the choices you make. Just click the … button next to Start-Up Code in the menu on the left.
All analyses begin with a data set. SAS Enterprise Miner software is used primarily to investigate large, complex data sets with tens of thousands to millions of records and hundreds to thousands of variables.
Data mining finds patterns and relationships in the data and determines whether the discovered patterns are valid. First, right-click Data Sources. Click Next to accept the default Display 1.
Select the library name where the data set is located and then click OK. Select the file, click OK, and then click Next. In Display 1. Click Next. You should routinely choose Advanced, even though the default is Basic. A rejected variable is not used in any data mining analysis. An input also called an independent variable is used in the various models and exploratory tools in the software. Often, input variables are used to predict a target value also called a dependent value. An identification variable is used to label a particular observation in the data set.
While they are not used to predict outcomes, identification variables are used to link different observations to the same ID.
Some data mining techniques require a target variable while others need only input variables. There are other categories of variables as well that are used for more specialized analyses. In particular, a time ID variable is used in place of the more standard ID variable when the data are tracked longitudinally.
A raw data set is used to perform the initial analyses, unless the data are longitudinal, in which case the data set is identified as transactional.
However, the procedures have different names, and sometimes perform slightly different functions. For example, the regression node in SAS Enterprise Miner performs linear or logistic regression, depending upon whether the target variable is continuous or discrete. The procedure itself determines the method to use.
Similarly, there is a node for clustering. However, not all the variables are listed. Use the scroll bar to access the remaining variables. You can change these roles, as shown in Display 1. For variables highlighted in the Role column, options are displayed. You can click on the role to be assigned to the variable. In this example data set, only some of the variables will be examined; others will be rejected because they are not needed.
However, this book focuses on the most basic and introduces others as needed. Similarly, the level can be changed. Most of the levels in this data set are either Nominal or Interval. Not all of the data fit those levels. Therefore, it will be changed to Interval.
They are explained in later chapters. For the basic analysis, the assigned role should be the default value of Raw. Other possible data set roles include Transaction, which is used if there is a time component to the data. Score is used when fresh data are introduced into a model. The remaining values will not be used in this example. It is sometimes necessary to change either a variable role or a level in the data set by selecting Edit Variables, which is found by right-clicking the data set name in the drop-down list Display 1.
Therefore, you must explore the data, draw some graphs of the data, and perform basic data summaries. These summaries include some tables. The first is to click Explore shown in the upper right-hand corner of Display 1.
A second way is to select Explore from the menu shown in Display 1. A third way is to click the StatExplore icon located on the Explore tab. Each of these options leads to the same exploration. However, in order to use the StatExplore icon, a diagram should be defined.
The other two options do not require a diagram. Once the data set is defined in the software, a diagram can be defined. The first step is to rightclick the Diagrams icon Display 1. Once you provide the name, the window on the right-hand side of the screen becomes white.
To explore the data using StatExplore, two icons are moved into the diagram window Display 1. Next, move the StatExplore icon, found on the Explore tab, to the diagram. Using the left mouse button, move from one icon to the next, and the two will connect.
Right-click the StatExplore icon to access a menu to run the exploration Display 1. A green border surrounds the icon when it is running Display 1. An error has occurred if the border turns red. Once the processing has completed, you are prompted for the next step. Click OK. Right-click the menu to view the results Output 1. The results from the StatExplore node provide a summary of the variables in the data set. Included in the View menu is an option to create plots of the data.
These options are discussed in more detail later in this chapter. Clicking Explore yields the result in Output 1. Therefore, it is essential to discuss the sample data set in more detail. This data set was chosen so that most readers would understand the basic domain of the data. The data set contains the workload assignments for university faculty in one academic department for a three-year period. There are a total of 14 variables and 71 observations. There are a number of reasons to examine the data: to determine if employee resources are used efficiently and to determine whether there are longitudinal shifts in workload assignments that might impact overall productivity.
In terms of domain knowledge, it is important to understand why workloads are assigned in this fashion and why overall trends can be overlooked. Faculty members are responsible for publications, presentations, and grants at the end of the year.
They are also responsible for teaching courses. As salaried employees, they have no required hours to fulfill on a daily or weekly basis. The workloads, then, are negotiated on an individual basis between each faculty member and administrative officials. Attempts to standardize workload requirements have not been entirely successful.
Without standardization, trends are often missed because the data are not examined and summarized. The variable list for the data set is given in Table 1. Year This is defined as an academic year—August to May. Percent instruction The percentage of a full-time workload devoted to instruction. Instruction includes teaching courses, supervising students in independent study and theses, and preparation of new courses. Percent courses The percentage of a full-time workload devoted solely to teaching courses.
Number of courses Course percentages have been standardized. Higher percentages are also given for teaching large lecture courses. Preparations If a new course is being taught, time is allocated for the extra preparation required. Percentage professional activity The time allocated for research activities. Faculty members are expected to publish, submit grant applications, and present at professional conferences.
Percentage service Faculty members are also required to serve on committees or in administrative assignments. Rank Full-time faculty members have three possible ranks: assistant, associate, and full. Assistants are usually not tenured; associate and full faculty members usually are. Sabbatical Every seven years, faculty members have the potential for a half-year at full pay or a full-year at half pay sabbatical, usually related to research activities.
Administration The percentage of service specifically allocated to administrative activities. Rank, Sabbatical, and Year are identified as class variables, containing nominal data. The remaining variables are identified as interval variables. These assignments are made automatically by the software.
To investigate these variables using the StatExplore node, examine the class variables Display 1. The number of responses per year is about the same. Most faculty members do not have sabbaticals. In addition, half-year at full pay is more popular than full-year at half pay. The bar charts from the output are interactive.
If there are too many levels, they cannot all be displayed at once. By moving the cursor, additional parts of the charts are displayed. Variables are displayed in Output 1.
Output 1. All interval variables are represented and numbered. The coefficient of variation is defined on the y-axis. To read the bars, move the cursor over one of them and specific information is displayed Output 1. Note that interval variables are displayed in the order of the magnitude of the coefficient of variation.
More plots are provided with the MultiPlot node, discussed later in this section. Other information provided in the StatExplore node includes summary statistics, means for interval variables, and frequency counts for class variables Output 1. There are 59 values of None for the variable Sabbatical compared to 3 values for the year and 9 values for a semester, either fall or spring.
In addition, consider the frequency for faculty rank. The total number of faculty members appears to be However, because the data cover three years, it is clear that there are duplicates.
For this reason, the percentages are more reliable. The percentage is given along with the frequency. Similarly, the mean number of courses is equal to 3.
Note that the Faculty variable is listed as an interval variable, with a mean and a standard deviation. However, numbers were used to mask faculty identity. Therefore, the level of that variable should have been changed to nominal and was erroneously left as interval.
The level of the variable can be changed by returning to the data set node and selecting Edit Variables Display 1. Only the first 10 faculty members are shown in the chart. By moving the cursor to the right of 10, the remaining faculty members are shown Output 1. You will use the MultiPlot node next to examine the variables in the data set Display 1. In particular, there are options for the types of charts that can be displayed Display 1. The default is the bar chart.
Introduction to Data Mining Using SAS Enterprise Miner
To examine the workloads data, bar charts were used. There are a total of 14 charts Output 1. Those who do have administrative responsibilities show considerable variability in the assigned workload.
Even though the variable, Faculty, has been changed to Nominal, it is still represented in a bar chart. The chart for faculty members provides no meaningful information at this point. Of those teaching large lectures, almost twice as many teach two large lecture courses 14 as teach one 8.
For the number of courses taught, most faculty members teach four courses in a year two per semester. However, the graph also shows that many teach reduced loads of less than four.
Only two faculty members teach more than the standard load of four courses. This allocation is reasonable because few courses taught in any semester are new. However, some faculty members receive time for new course preparation if they have never taught the course before. The Rank variable demonstrates what the frequency summary statistic 30 Introduction to Data Mining Using SAS Enterprise Miner already indicated: there are far fewer assistant rank faculty members compared to those of associate or full rank.
Faculty members in this department are clearly aging. The chart clearly indicates that most faculty members are not taking a sabbatical. Most faculty members are not working with any students on independent study or theses. However, a small number of faculty members are very active in working with students.
The workload for supervision is decidedly skewed. Chart 10 indicates that many faculty members teach at least one calculus course during an academic year. Slightly more teach two calculus courses during the year. The shape of the distributions differs somewhat. Therefore, some faculty members have additional instructional responsibilities that are not related to teaching courses compared to other faculty members.
Faculty members generally have more time allocated for research activities than for service activities. Why is there so much variability in workload assignments, especially for faculty members of similar ranks?
Other areas to investigate include whether there is a shift over the three-year time frame or if some of the variables are related to each other. These questions are explored in more detail in later chapters. It is up to you to use the tools and to interpret the findings. The more you are aware of domain knowledge, the more information you can extract from the data. It is particularly difficult, if not impossible, to attempt to interpret the results of a data mining investigation without any domain knowledge.
Without any attempt to verify the pattern of 5 heads followed by 5 tails, it is possible although not valid to conclude that every time a coin is flipped 10 times, the same pattern will occur. It is a conclusion that is easily contradicted if the coin is flipped several more times. Although this example seems somewhat absurd, other patterns seem to be accepted by many people with even less justification.
Pattern recognition without validation is a reason that data mining as a method was often disparaged in statistical training. Therefore, it is strongly recommended that you partition the data routinely into three data sets: training, validation, and testing. The validation data set iteratively ensures that the developed model fits a fresh data set.
Once the model is completed, the testing data set makes a final comparison. Because the division of a data set is so crucial to validation of the process, the software is set up so that splitting the data set into three components is almost automatic.
For a given target value, the accuracy of the final model is initially judged by the misclassification rate, where misclassification occurs when the predicted target value is not equal to the actual target value. There are additional diagnostics in the software that are discussed later. Another difference between traditional statistics and data mining is that there are often many different models that can be used to investigate the data.
Instead of choosing just one model to define a specific p-value, many different models are used and compared. Assessment methods have been developed to make these comparisons using the training, validation, and testing methodology.
Scoring relates the predicted value to the actual value, and the closeness of one to the other can be examined using other statistical techniques.
Scoring is particularly important when examining the likelihood of a customer making a particular download and the amount of the download. In this case, scoring assigns a level of importance to a customer. For wireless phone service, churn meaning that a customer decides to switch phone providers is important; the provider wants to predict in advance those who are likely to churn in order to give those customers incentives to stay.
How can a business predict the likelihood of churn to offer incentives to prevent it? Scoring provides a means of making such predictions.
A number of icons on the Sample and Modify tabs are useful in investigating data. The Partition icon Display 1. This icon divides the data into three sets: train, validate, and test.
For many of the models in the software, the training data set initially defines the model; the validation data set tests the model in an iterative process with the training set.
The testing data set examines the overall accuracy of the model once it is complete. These defaults can be modified by moving the cursor and changing the values. Unless there is a good reason to change them, the defaults should be used to partition the data.
The Drop icon Display 1. Specific variables might be used or dropped as models are developed to predict outcomes. While the initial list of variables might have inputs, you might need to focus on just 10 for the model.
In that case, the remaining 90 can be dropped for the remaining part of the analysis. Right-clicking any variable allows Default to change to Yes or No depending on whether you want to drop or keep the variable.
The default indicates the assignment from the initial data set.Text mining tools use singular value decomposition and text parsing in combination with classification techniques to extract information. None Q 8: Chegg Solution Manuals are written by vetted Chegg Math experts, and rated by students - so you know you're getting high quality answers.
Poor quality data can cause more immediate harm and have other more indirect effects, according to Lambert You can download our homework help app on iOS or Android to access solutions manuals on your mobile device.
The default is the bar chart. The emphasis on these exploratory or pattern recognition techniques in data mining is a helpful addition to the literature illustrating the application of SAS Enterprise Miner.