The Logi Predict add-on module for Logi Info allows users to analyze historical or transactional data and make statistical predictions about current data. In the accompanying Logi Predict application, the initial steps include creating and training a predictive model.
The Classification model is used to predict categorical values and is useful for answering questions that have binary (Yes/No) answers, such as "Is this customer likely to default on a loan"? Other similar examples include:
How likely is this customer to churn?
Will a user click on this link?
Will a user buy this product?
Is this a fraudulent transaction?
Is this employee at risk of leaving?
Is this a low or high risk claim?
Algorithms such as Generalized Linear Model for Two Values and Random Forest are used in Classification models to analyze historical data in order to identify patterns and indicators that can then be applied to new data.
The goal is to be able to use the insights produced by the predictions created with this model to better engage with your users or your customers in a positive manner.
For example, consider a scenario wherein we want to identify customers in danger of "churning" or turning over, which we'll define as those who bought something at least twice prior to 2016 but have bought nothing since then. Logi Predict will use historical transaction data, such as the last order date, the number of recent orders, and the amount spent, to train a classification model for this purpose.
Then that model will be used with current transaction data to predict which customers are at risk of churning. The prediction can then be integrated with operational systems to guide workflow during future customer interactions, to prevent churn.
This topic assumes you've selected New Model... in the Logi Predict application and have been shown this panel:
Select the Classification Model and click Create, as shown above.
When the new model page is displayed, you can give the model a name by clicking the gear icon, as shown above. Note that you can also duplicate a model, in order to use an existing model as a template for a new one. Use the following steps to create and train the model:
Step 1. Select Historical Data - Select the historical data to be analyzed for predictive patterns and indicators. This should not be the new or current data from which you want to generate predictions. If you have multiple data sources to choose from, additional selection controls will be displayed. Select the columns to be used, apply filters or formula columns as desired, and click OK, and the selected data will appear in a 10-row table.
You can expand and collapse the data selection controls using the icons.
Step 2. Select Training Options - Use these controls to configure the training parameters. The options, which will vary based on the algorithm selected, are:
- Predictable Column - Select the column from the database that contains the historical values that will be predicted in the new data. Once a column is selected, the Determine Column Importance link will appear. Click it to view the Column Importance panel, and click Calculate in it to begin a column importance analysis:
This analysis, based on a statistical sample of rows, identifies which columns in the historical data will have the most influence on the prediction. Models work best when unimportant columns are not used to train them. The analysis will automatically remove highly-correlated columns from consideration.
You can experiment by selecting and de-selecting columns here. Then click OK, Save Column Selections to modify the data columns selected in the Step 1. data selection, then click Recalculate to get a fresh analysis. Repeat the process, if desired. We recommend that you only use the 5-10 most influential columns for the best results. Click X to close the panel.
- Algorithm - Select a statistical algorithm to use when training the model. Algorithms are discussed in detail in About Logi Predict. You may want to make multiple training runs with different algorithms to determine the best accuracy. The available algorithms include:
- Generalized Linear Model for Two Values - Provides faster training, but can be less accurate
- Gradient Boosted Model - Provides fast, ensemble training using cumulative classifiers and is very accurate
- Random Forest - Provides fast, ensemble training using decision trees and is very accurate
- Number of Trees - The Random Forest algorithm uses multiple decision trees and aggregates their results. Higher values (more decision trees) cause more complete data utilization during training, possibly increasing accuracy, but also taking more time to run.
- Optimize for Imbalanced Data - This option is useful when a majority of the rows have the same value in influential columns.
- Smart Data Cleanup - The feature helps handle "dirty" data (data with nulls and other imperfections), improving accuracy. See this section for more information about this process.
- Data Volume - This selection allows you to make "test" training runs with subsets of your data and to tweak your model for the best accuracy before committing to running the training with all of the historical data. In some cases, accuracy may fluctuate when more rows are used, so training with different data volumes is useful.
Click Train Model to start the training process.
Step 3. Review and Select Training Results - When model training begins, an entry is added to a results table at the bottom of the page. You can abort the training run, if necessary, by clicking the Abort Training... button shown above.
When training finishes, the table entry will include data similar to that shown above. Column contents include:
- Status - The current status of the training run. As shown earlier, during the run an Abort Training... link is displayed here. Any errors that may occur during training will be described in a message here as well.
- Selected Training - There may be several entries in this table, describing several training runs for this model. A green checkmark here indicates that this training has been selected for use when this model is used in a prediction plan. Unselected entries will only have a Go Live link in this column. Click New Plan to create a new prediction plan using this training/model.
Click the icon to open the API wizard. For more information, see Logi Predict API.
- Algorithm - The algorithm-related training options used for this training run.
- Training Columns - The number of columns selected for use during training. The number is also a link, click it to see information that provides insight into how the data was used:
In the example above, you can see which columns were used for the training, and also which columns were affected by the Smart Data Cleanup feature. In the example, three columns were ignored during training because more than 40% of their values were blank. Click X to close the panel.
- Training Rows - The number of rows of data used in this training.
- Test Predictions Distribution - An indication of the number of categorizations (2) used and a visualization of the distribution of the two values in the data.
Hover your mouse cursor over the chart, as shown above, to see the actual distribution count.
- Test Accuracy % - The accuracy of this model after training with these options. Click the View Details link to see information about this accuracy score:
These statistics indicate the quality of the model. If the accuracy is low, try adjusting the training options and retraining the model. Click X to close the panel.
- Time - The timestamp and duration of the training.
- Actions - The Refresh icon sets the controls for data selections and training options on this page to those used when this training was run, so that you can see the full details or make adjustments and rerun the training. Click the Trashcan icon, which only appears in table entries not selected for use with a prediction plan but is shown here for illustration purposes, to delete the entry from the table.
You can click many of the column headers to sort the table on that column.
Once you've created and trained an accurate model, you're ready to use it in a prediction plan, as discussed in Logi Predict Setup and Use.
Production data, as we all know, is rarely in pristine condition. This can affect the accuracy of model training, so Logi Predict includes a Smart Data Cleanup feature that attempts to compensate for various data situations.
Smart Data Cleanup does not update or rewrite the actual data in your data source - it applies its compensations to values in memory after reading the data, before they're used for training.
Smart Data Cleanup works by applying a set of data management rules, which are shown below. Note that the thresholds and quantities shown are defaults that may be changed by your system admin or application developer.
- Ignore highly-correlated columns (columns with values that match 80% of another column's values)
- Ignore columns whose values are all the same (zero variance)
- Ignore duplicated rows
- Ignore a Numeric column with missing data values ("NA") in 40% or more of its rows
- Replace the NAs with a Mean value in Numeric columns with NAs in less than 40% of its rows
- Replace Categorical column "" with NAs
- Replace Categorical column NAs with the "UnknownNAs" string value as a new category
- Replace spaces or invalid characters in Categorical column values with dots or underscores (R Make.Names function)
- Ignore Categorical values with more than six levels
Unless your data has been specifically groomed in advance for analysis, we recommend that you leave this feature enabled in the training options.