The Logi Predict add-on module for Logi Info allows users to analyze historical or transactional data and make statistical predictions about current data. In the accompanying Logi Predict application, the initial steps include creating and training a predictive model.
The Clustering model lets you gather data points into smart groups or segments based on their attributes, such as grouping customers into smart "buckets" based on buying patterns and demographics. Other examples include:
Grouping loans into smart buckets based on loan attributes.
Grouping SaaS customer data into groups to understand global patterns.
Grouping insurance policy holders.
The popular K-Means algorithm is used to analyze historical data in order to identify patterns and indicators that can then be applied to new data.
This topic assumes you've selected New Model... in the Logi Predict application and have been shown this panel:
Select the Clustering Model and click Create, as shown above.
When the new model page is displayed, you can give the model a name by clicking the gear icon, as shown above. Note that you can also duplicate a model, in order to use an existing model as a template for a new one. Use the following steps to create and train the model:
Step 1. Select Historical Data - Select the historical data to be analyzed for predictive patterns and indicators. This should not be the new or current data from which you want to generate predictions. If you have multiple data sources to choose from, additional selection controls will be displayed. Select the columns to be used, apply filters or formula columns as desired, and click OK, and the selected data will appear in a 10-row table.
You can expand and collapse the data selection controls using the icons.
Step 2. Select Training Options - Use these controls to configure the training parameters. The options, which will vary based on the algorithm selected, are:
- Algorithm - Select a statistical algorithm to use when training the model. Algorithms are discussed in detail in About Logi Predict. The available algorithms include:
- K-Means - A widely-used method of grouping data into a specified numbers of clusters, based on mean data values. This algorithm uses "incremental clustering" wherein analysis of each data row builds on that of previous rows to determine appropriate clusters.
- Number of Clusters - Specify the number of clusters into which you want to place the data.
- Smart Data Cleanup - The feature helps handle "dirty" data (data with nulls and other imperfections), improving accuracy. See this section for more information about this process.
- Data Volume - This selection allows you to make "test" training runs with subsets of your data and to tweak your model for the best accuracy before committing to running the training with all of the historical data. In some cases, accuracy may fluctuate when more rows are used, so training with different data volumes is useful.
Click Train Model to start the training process.
Step 3. Review and Select Training Results - When model training begins, an entry is added to a results table at the bottom of the page. You can abort the training run, if necessary, using the link shown above.
When training finishes, the table entry will include data similar to that shown above. Column contents include:
- Status - The current status of the training run. As shown earlier, during the run an Abort Training... button is displayed here. Any errors that may occur during training will be described in a message here as well.
- Selected Training - There may be several entries in this table, describing several training runs for this model. A green checkmark here indicates that this training has been selected for use when this model is used in a prediction plan. Unselected entries will only have a Go Live link in this column. Click New Plan to create a new prediction plan using this training/model.
Click the icon to open the API wizard. For more information, see Logi Predict API.
- Algorithm - The algorithm-related training options used for this training run.
- Training Columns - The number of columns selected for use during training. The number is also a link to information that provides insight into how the data was used:
In the example above, you can see which columns were used for the training. Any columns affected by Smart Data Cleanup will also be listed here, with an explanation. Click X to close the panel.
- Training Rows - The number of rows of data used in this training.
- Cluster Details - A visualization of the distribution of the values into clusters, and the Details and Naming link.
Hover your mouse cursor over the chart, as shown above, to see the actual distribution count.
Click the Details and Naming link to view summary information about the columns and clusters used in the training. In the example shown above, optional names have been assigned to the clusters; entries here are automatically saved. The names will be used to identify the data in the results when the model is used in a prediction plan. Click X to close the panel.
- Cluster Quality % - The accuracy of the clustering in this model after training with these options. If quality is low, try adjusting the training options and retraining the model.
- Time - The timestamp and duration of the most recent training.
- Actions - The Refresh icon sets the controls for data selections and training options on this page to those used when this training was run, so that you can see the full details and/or make adjustments and rerun the training. Click the Trashcan icon, which only appears in table entries not selected for use with a prediction plan but is shown here for illustration purposes, to delete the entry from the table.
You can click many of the column headers to sort the table on that column.
Once you've created and trained an accurate model, you're ready to use it in a prediction plan, as discussed in Logi Predict Setup and Use.
Production data, as we all know, is rarely in pristine condition. This can affect the accuracy of model training, so Logi Predict includes a Smart Data Cleanup feature that attempts to compensate for various data situations.
Smart Data Cleanup does not update or rewrite the actual data in your data source - it applies its compensations to values in memory after reading the data, before they're used for training.
Smart Data Cleanup works by applying a set of data management rules, which are shown below. Note that the thresholds and quantities shown are defaults that may be changed by your system admin or application developer.
- Ignore highly-correlated columns (columns with values that match 80% of another column's values)
- Ignore columns whose values are all the same (zero variance)
- Ignore duplicated rows
- Ignore a Numeric column with missing data values ("NA") in 40% or more of its rows
- Replace the NAs with a Mean value in Numeric columns with NAs in less than 40% of its rows
- Replace Categorical column "" with NAs
- Replace Categorical column NAs with the "UnknownNAs" string value as a new category
- Replace spaces or invalid characters in Categorical column values with dots or underscores (R Make.Names function)
- Ignore Categorical values with more than six levels
Unless your data has been specifically groomed in advance for analysis, we recommend that you leave this feature enabled in the training options.