When attempting to train a model with a messy data set you may sometimes get a "Failed data check" message back form the analytical engine. This means the engine was not able to start training because the dataset you submitted does not pass quality check. Below are a number of steps you should check with your dataset during variable selection prior to resubmitting it for training:
- Is your dataset empty once all records are removed? If you think you might have a lot of nulls or empty values in your dataset, change your missing values strategy to something other than exclude. For instance if it's a categorical value, you may want to select "Use Unknown" as an in-fill strategy.
- Are you including categorical values with many categories? If you are including a categorical variable that has many values, e.g. thousands, you may want to replace it with a streamlined version that has fewer values. If you submit a training request with a dataset that includes over 1,000 different categorical values it will fail to pass data checks.
- Does your record index include duplicates? If your record index includes duplicates the records you will get back after either training or predicting may not be accurately tagged with the correct ID. Make sure your record index does not include any duplicates, same as any primary key.
- Are all variable types accurately captured by Analyzr? Check that the categorical variables you selected are correctly classified as categorical; do the same for your numerical variables.
Still having issues with training your model? Feel free to reach out to the Analyzr Support team and submit a ticket.