The Logi Predict add-on module for Logi Info allows users to analyze historical or transactional data and make statistical predictions about current data. In the accompanying Logi Predict application, the initial steps include creating and training a predictive model.
The Forecast model is used to predict the possible value of a metric. Use this model for questions with numeric answers, such as "How many rail passengers will we carry on weekends?" Other example questions include:
What will Call Center case volume be tomorrow? Next week, next month?
How long will it take to perform this job?
How many customers would come to my shop?
What will the response time be for this issue?
What would the emergency wait time be?
How many watches should I keep in inventory?
How many customers will default?
Algorithms such as Linear Regression and Random Forest are used in Forecast models to analyze historical data in order to statistically identify patterns and indicators that can then be applied to new data.
This topic assumes you've selected New Model... in the Logi Predict application and have been shown this panel:
Select the Forecast Model and click Create, as shown above.
When the new model page is displayed, you can give the model a name by clicking the gear icon, as shown above. Note that you can also duplicate a model, in order to use an existing model as a template for a new one. Use the following steps to create and train the model:
Step 1. Select Historical Data - Select the historical data to be analyzed for predictive patterns and indicators. This should not be the new or current data from which you want to generate predictions. If you have multiple data sources to choose from, additional selection controls will be displayed. Select the columns to be used, apply filters or formula columns as desired, and click OK, and the selected data will appear in a 10-row table.
You can expand and collapse the data selection controls using the icons.
Step 2. Select Training Options - Use these controls to configure the training parameters. The options, which will vary based on the algorithm selected, are:
- Predictable Column - Select the column from the database that contains the historical values that will be predicted in the new data. Once a column is selected, the Determine Column Importance link will appear. Click it to view the Column Importance panel, and click Calculate in it to begin a column importance analysis.
This analysis, based on a statistical sample of rows, identifies which columns in the historical data will have the most influence on the prediction. Models work best when unimportant columns are not used to train them. The analysis will automatically remove highly-correlated columns from consideration.
You can experiment by selecting or de-selecting columns here. Then click OK, Save Column Selections to modify the data columns selected in the Step 1. data selection, then click Recalculate to get a fresh analysis. We recommend that you only use the 5-10 most influential columns for the best results. Click X to close the panel.
- Algorithm - Select a statistical algorithm to use when training the model. Algorithms are discussed in detail in About Logi Predict. You may want to make multiple training runs with different algorithms to determine the best accuracy. The available algorithms include:
- Gradient Boosted Model - Provides fast, ensemble training and is very accurate
- Linear Regression - Provides faster training, but can be less accurate
- Random Forest - Provides fast, ensemble training using decision trees and is very accurate
Number of Trees - (if Random Forest algorithm selected) The Random Forest algorithm uses multiple decision trees and aggregates their results. Higher values (more decision trees) cause more complete data utilization during training, possibly increasing accuracy, but also taking more time to run.
Optimize for Imbalanced Data - (if Random Forest algorithm selected) This option is useful when a majority of the rows have the same value in influential columns.
- Smart Data Cleanup - The feature helps handle "dirty" data (data with nulls and other imperfections), improving accuracy. See this section for more information about this process.
- Data Volume - This selection allows you to make "test" training runs with subsets of your data and to tweak your model for the best accuracy before committing to running the training with all of the historical data. In some cases, accuracy may fluctuate when more rows are used, so training with different data volumes is useful.
Click Train Model to start the training process.
Step 3. Review and Select Training Results - When model training begins, an entry is added to a results table at the bottom of the page. You can abort the training run, if necessary, using the link shown above.
When training finishes, the table entry will include data similar to that shown above. Column contents include:
- Status - The current status of the training run. As shown earlier, during the run an Abort Training... link is displayed here. Any errors that may occur during training will be described in a message here as well.
- Selected Training - There may be several entries in this table, describing several training runs for this model. A green checkmark here indicates that this training has been selected for use when this model is used in a prediction plan. Unselected entries will only have a Go Live link in this column. Click New Plan to create a new prediction plan using this training/model.
Click the icon to open the API wizard. For more information, see Logi Predict API.
- Algorithm - The algorithm-related training options used for this training run.
- Training Columns - The number of columns selected for use during training. The number is also a link to information that provides insight into how the data was used:
In the example above, you can see which columns were used for the training. Any columns affected by Smart Data Cleanup will also be listed here, with an explanation. Click X to close the panel.
- Training Rows - The number of rows of data used in this training.
- Predictable Column Range - An indication of the range of values encountered in the predictable column during training.
As shown above, the numbers are the Minimum, Mean, and Maximum values.
- Margin of Error - The margin of error (RMSE) for this model after training with these options. The lower the margin of error, the more accurate the model.
- Time - The timestamp and duration of the training.
- Actions - The Refresh icon sets the controls for data selections and training options on this page to those used when this training was run, so that you can see the full details and/or make adjustments and rerun the training. Click the Trashcan icon, which only appears in table entries not selected for use with a prediction plan but is shown here for illustration purposes, to delete the entry from the table.
You can click many of the column headers to sort the table on that column.
Once you've created and trained an accurate model, you're ready to use it in a prediction plan, as discussed in Logi Predict Setup and Use.
Production data, as we all know, is rarely in pristine condition. This can affect the accuracy of model training, so Logi Predict includes a Smart Data Cleanup feature that attempts to compensate for various data situations.
Smart Data Cleanup does not update or rewrite the actual data in your data source - it applies its compensations to values in memory after reading the data, before they're used for training.
Smart Data Cleanup works by applying a set of data management rules, which are shown below. Note that the thresholds and quantities shown are defaults that may be changed by your system admin or application developer.
- Ignore highly-correlated columns (columns with values that match 80% of another column's values)
- Ignore columns whose values are all the same (zero variance)
- Ignore duplicated rows
- Ignore a Numeric column with missing data values ("NA") in 40% or more of its rows
- Replace the NAs with a Mean value in Numeric columns with NAs in less than 40% of its rows
- Replace Categorical column "" with NAs
- Replace Categorical column NAs with the "UnknownNAs" string value as a new category
- Replace spaces or invalid characters in Categorical column values with dots or underscores (R Make.Names function)
- Ignore Categorical values with more than six levels
Unless your data has been specifically groomed in advance for analysis, we recommend that you leave this feature enabled in the training options.