AI Literacy: Comparing AI models

Get familiar with the key terms to help you evaluate and compare AI models

Nov 30, 2024

a computer generated image of the letter a — Photo by Steve Johnson on Unsplash

Welcome to the AI Literacy Mini-Series

Whilst Artificial intelligence (AI) and Machine Learning (ML) have been actively researched for many years, the recent breakthroughs in their capabilities and adoption are poised to have profound, lasting impacts on business and the world.

As part of the ‘AI Literacy’ mini-series, we will delve into key concepts and terminology you need to know to better inform your decision making, enabling you to:

Interpret and understand the latest news and developments
Gain a deeper understanding of the core mechanics that drive AI technology

Comparing AI Models

A combination of quantitative metrics, evaluation techniques and practical considerations need to be evaluated to get a holistic view of performance

Key Topics Covered

Performance evaluation metrics
Evaluation techniques
Practical considerations
Transparency and openness
LLM model-specific parameters

Performance Evaluation Metrics provide quantitative measures of how well the AI model performs

Accuracy: measures the correctness of the model's predictions. It is commonly used for classification tasks and is expressed as a percentage (%).
Precision and Recall: Precision measures the proportion (%) of correctly predicted positive instances out of all predicted positive instances. Recall measures the proportion (%) of correctly predicted positive instances out of all actual positive instances.
F1 Score: Is the harmonic mean of precision and recall. It is useful when both precision and recall need to be considered together.
Mean Absolute Error (MAE): Regression focused metric which measures the average absolute difference between the predicted values and the true values. Lower values indicate better performance.
Mean Squared Error (MSE): Regression focused metric which measures the average of the squared differences between the predicted values and the true values. Lower values indicate better performance.
R-squared (R²) Score: Regression focused metric which measures the proportion of the variance in the dependent variable that is predictable. It ranges from 0 to 1, with a higher value indicating a better fit of the model to the data.
Area Under the Receiver Operating Characteristic Curve (AUC-ROC): Is an evaluation metric used for binary classification models to measure the model's ability to rank instances correctly and is useful for imbalanced class distributions or when the cost of false positives and false negatives is unequal.

The ROC curve is created by plotting the below at various classification thresholds:
- True Positive Rate (TPR) or sensitivity or recall to show the proportion (%) of correctly predicted positive instances out of all actual positive instances.
- False Positive Rate (FPR) to show the proportion (%) of incorrectly predicted negative instances out of all actual negative instances.
AUC is the area under the ROC curve. It ranges from 0 to 1, with a higher value indicating better performance. An AUC of 0.5 suggests that the model performs on par with a random guess.
Mean Average Precision (MAP): It calculates the average precision at each relevant document rank and then averages them to provide a single score. It is typically used to evaluate information retrieval systems, such as recommendation or search engines.

Evaluation Techniques provide additional ways of analysing AI model performance

Confusion Matrix: A table that summarises the model's performance by showing the counts of true positive, true negative, false positive, and false negative predictions. It provides a more detailed understanding of the model's performance and can be used to calculate various evaluation metrics.
Cross-Validation: Used to assess the model's performance on multiple subsets of the data. It helps evaluate the model's generalisation ability and reduces the risk of overfitting or underfitting.
HumanEval: Refers to assessing AI models by having humans rate their performance, and is commonly expressed as a percentage (%) if looking at comparison tables.

Practical Considerations:

Computational Performance: Apart from predictive performance, it's important to consider the computational requirements of the model, such as training time, memory usage, and inference speed. These factors impact the model's practicality and scalability.
Speed (tokens per second): Measures the how quickly an LLM model processes tokens to check its efficiency.

Transparency and Openness:

Open Source: Source code is publicly available to review, modify, and share. This ensures transparency in the source code to discourage malicious codes, and pre-built biases
Closed Source: The source-code that is not publicly available as it is kept private.

LLM Model-Specific Parameters:

Number of Tokens: Refers to the size of the vocabulary that the LLM is trained on. Larger vocabulary allows the LLM to generate more creative text, but is also more computationally intensive.
Token Context Window: The maximum number of tokens analysed when predicting the next token. The larger the number, the more context can be included and considered in a prompt.
Measuring Massive Multitask Language Understanding (MMLU): Used to evaluate an LLM model's ability to handle diverse language understanding tasks by measuring performance across multiple tasks simultaneously.

Thank you for reading! Stay tuned for the continuation of this series. Hopefully now you will be able to interpret the results of any AI model comparison tables!

Follow @ExecSumFIN on twitter or Substack Note for daily news updates and notifcations.

Like what you read, refer a friend and get rewarded, or leave a like or comment.

Refer a friend

Executive Summary

Discussion about this post