The Logi Predict add-on module for Logi Info allows users to analyze historical or transactional data and make statistical predictions about current data. In the accompanying Logi Predict application, the initial steps include creating and training a predictive model.
The Outliers model is used to identify values in the data that are outside the range of "what's expected". That's a subjective judgment, of course. Some define it as values that are far away from the median, but how far is "far away"? Or, it could be defined as a multiple of the standard deviation, or it could also be based on interquartile ranges. Those are simple ways to define outliers for a single variable.
While most people understand single variable outliers, outliers can also exist when there are multiple variables. This is more common in complex data. Logi Predict can predict outcomes for either data scenario.
Outliers can also be hierarchical. If your data is hierarchical, outliers for one category could be different than they are for another category.
Similar to Clustering models, an Outliers model works best with continuous variables (i.e. numeric data).
Here are some examples of the kinds of questions that can be answered with this model:
- Is this a fraudulent claim?
- Do sensor measurements indicate an anomaly?
- Are call volumes out of the ordinary?
- Has application response time changed?
Algorithms such as All Together and Column by Column are used in Outliers models to analyze historical data in order to identify patterns and indicators that can then be applied to new data.
This topic assumes you've selected New Model... in the Logi Predict application and have been shown this panel:
Select the Outliers Model and click Create, as shown above.
When the new model page is displayed, you can give the model a name by clicking the gear icon, as shown above.
You can also duplicate a model, in order to use an existing model as a template for a new one. Use the following steps to create and train the model:
Step 1. Select Historical Data - Select the historical data source and data type to be analyzed for predictive patterns and indicators. This should not
be the new or current data from which you want to generate predictions. If you have multiple data sources to choose from, additional selection controls will be displayed. Select the columns
to be used, apply filters or formula columns as desired, and click OK, and
the selected data will appear in a 10-row table.
You can expand and collapse the data selection controls using the icons.
Step 2. Select Training Options - Use these controls to configure the training parameters:
- Algorithm - Select a statistical algorithm to use when training the model. Algorithms are discussed in detail in About Logi Predict . You may want to make multiple training runs with different algorithms to determine the best accuracy. The available algorithms include:
- All Together - Evaluates numeric columns in context with each other.
- Column by Column - Evaluates numeric columns individually, identifying outliers for each column value.
Regardless of which algorithm is chosen, each row's text values are combined and numeric values are then evaluated within each unique combination.
- Smart Data Cleanup - The feature helps handle "dirty" data (data with nulls and other imperfections), improving accuracy. For more information about this process, see this section.
- Data Volume - This selection allows you to make "test" training runs with subsets of your data and to tweak your model for the best accuracy before committing to running the training with all of the historical data. In some cases, accuracy may fluctuate when more rows are used, so training with different data volumes is useful.
Click Train Model to start the training process.
Step 3. Review and Select Training Results - When model training begins, an entry is added to a results table at the bottom of the page. You can abort the training run, if necessary, by clicking the button shown above.
When training finishes, the table entry will include data similar to that shown above. Column contents include:
- Status - The current status of the training run. As shown earlier, during the run an Abort Training... link is displayed here. Any errors that may occur during training will be described in a message here as well.
- Selected Live Training - There may be several entries in this table, describing several training runs for this model. A green checkmark here indicates that this training has been selected for use when this model is used in a prediction plan. Unselected entries will only have a Select link in this column. Click New Plan to create a new prediction plan using this training/model.
Click the icon to open the API wizard. For more information, see Logi Predict API.
- Algorithm - The algorithm-related training options used for this training run.
- Training Columns - The number of columns selected for use during training. The number is also a link to information that provides insight into how the data was used:
In the example above, you can see which columns were used for the training. You'll also see which, if any, columns were affected by the Smart Data Cleanup feature. Click X to close the panel.
- Training Rows - The number of rows of data used in this training.
- Outliers - A graphic indication of the number of outliers found.
- Time - The timestamp and duration of the training.
- Actions - The Refresh icon sets the controls for data selections and training options on this page to those used when this training was run, so that you can see the full details and/or make adjustments and rerun the training. Click the Trashcan icon, which appears in table entries not selected for use with a prediction plan, to delete the entry from the table.
You can click many of the column headers to sort the table on that column.
Once you've created and trained an accurate model, you're ready to use it in a prediction plan, as discussed in Logi Predict Setup and Use.
Production data, as we all know, is rarely in pristine condition. This can affect the accuracy of model training, so Logi Predict includes a Smart Data Cleanup feature that attempts to compensate for various data situations.
Smart Data Cleanup does not update or rewrite the actual data in your data source - it applies its compensations to values in memory after reading the data, before they're used for training.
Smart Data Cleanup works by applying a set of data management rules, which are shown below. The thresholds and quantities shown are defaults that may be changed by your system admin or application developer.
- Ignore duplicated rows
- Ignore a Numeric column with missing data values ("NA") in 40% or more of its rows
- Replace the NAs with a Mean value in Numeric columns with NAs in less than 40% of its rows
- Replace Categorical column "" with NAs
- Replace Categorical column NAs with the "UnknownNAs" string value as a new category
- Replace spaces or invalid characters in Categorical column values with dots or underscores (R Make.Names function)
Unless your data has been specifically groomed in advance for analysis, we recommend that you leave this feature enabled in the training options.