Machine Learning

Classification Report in ML: A Complete Guide

A classification report reveals ML model performance by class: precision, recall, F1-score, and support. Learn to read, interpret, and act on your results.

Andrew Martin
14 min read
Classification report machine learning metrics visualization abstract gradient

Don't Trust Accuracy on Imbalanced Data

A model predicting the majority class 100% of the time achieves high accuracy but zero recall for the minority class. Always check the full classification report before declaring a model production-ready.

A classification report gives you a complete diagnostic picture of your ML model’s performance broken down by class. Where accuracy gives you one number, a classification report gives you four metrics per class — precision, recall, F1-score, and support — revealing where your model excels and where it fails, particularly on minority classes.

The scikit-learn classification_report() function generates this output automatically from your model’s predictions. Understanding how to read it, and more importantly how to act on it, is one of the most practical skills in applied machine learning. This guide covers the full output, including the averaging methods that most practitioners overlook.

What Is a Classification Report?

A classification report is a structured performance summary of a classification model, displaying precision, recall, F1-score, and support for each output class. It is generated from the confusion matrix and provides class-level diagnostic information that aggregate metrics like accuracy cannot surface.

Scikit-learn’s classification_report() function (Pedregosa et al., JMLR 2011) is the standard implementation in Python ML workflows. It accepts the actual labels and predicted labels from your test set and returns a formatted report alongside an optional dictionary for programmatic use.

The Classification Report Structure

A typical binary classification report on a fraud detection dataset looks like this:

ClassPrecisionRecallF1-ScoreSupport
Not Fraud (0)0.990.980.999,900
Fraud (1)0.620.730.67100
Accuracy0.9710,000
Macro Avg0.810.860.8310,000
Weighted Avg0.980.970.9710,000

Notice: 97% accuracy looks strong, but the fraud class — what you actually care about — has an F1-score of only 0.67. This is precisely the insight a classification report surfaces that accuracy hides.

The Confusion Matrix Foundation

Every metric in a classification report derives from the confusion matrix — a grid of actual versus predicted classes. For a binary problem, this gives four counts:

  • True Positives (TP): Positive class instances correctly predicted as positive
  • False Positives (FP): Negative class instances incorrectly predicted as positive
  • False Negatives (FN): Positive class instances incorrectly predicted as negative
  • True Negatives (TN): Negative class instances correctly predicted as negative

For multi-class problems, scikit-learn applies a one-versus-rest approach: each class is treated as positive while all others are treated as negative, yielding a per-class confusion matrix calculation.

When to Use a Classification Report

Use a classification report whenever your model predicts discrete classes. This covers binary problems (fraud, churn, spam) and multi-class problems (email routing into five categories, product classification into twenty types, patient risk into low/medium/high tiers). Always generate a classification report on your held-out test set — not on training data — to avoid overoptimistic results from model overfitting. For machine learning algorithms that output class probabilities (logistic regression, gradient boosting, neural networks), the report evaluates predictions at the default 0.5 threshold unless you specify otherwise.

The Four Core Metrics Explained

The four metrics in a classification report — precision, recall, F1-score, and support — each answer a specific question about model behavior. Precision quantifies how often positive predictions are correct, recall measures how many actual positives were caught, F1 balances the two, and support tells you how many instances back each estimate. Understanding what each measures determines which one to optimize for your business context.

Precision: Minimizing False Alarms

Precision measures what fraction of your model’s positive predictions are actually correct. Formula: Precision = TP / (TP + FP). High precision means few false alarms. A model flagging 100 transactions as fraud with 80% precision correctly identifies 80, while 20 are legitimate transactions wrongly blocked.

Precision matters when the cost of false positives is high. Spam filters prioritize precision — incorrectly blocking a legitimate business email costs more than letting spam through. Marketing campaign targeting also favors precision: sending an expensive direct mailer to uninterested prospects wastes budget. Our dedicated guide on what precision is in machine learning walks through worked threshold tuning examples and the cost-matrix framework you should apply before locking in a production threshold.

Low precision in your report indicates your model is over-predicting the positive class. Solutions include raising your classification threshold, applying stronger regularization during training, or engineering features that better separate positive from negative cases.

Recall: Catching What Matters

Recall measures what fraction of all actual positive instances your model correctly detected. Formula: Recall = TP / (TP + FN). High recall means few misses. A medical screening model with 90% recall correctly identifies 90 out of 100 actual positive cases, though it may over-flag some healthy patients.

Recall is the priority when missing a positive case is catastrophic — medical diagnosis, fraud detection, and safety-critical anomaly detection all optimize for recall first. Recall and sensitivity are the same metric; for a full breakdown of the recall formula and its tradeoffs with specificity, see the ML sensitivity and recall guide.

Low recall in your classification report tells you the model is missing too many real positives. Lowering your threshold, resampling the minority class, or applying class weights during training are the primary remedies.

F1-Score: Balancing Precision and Recall

F1-score is the harmonic mean of precision and recall: F1 = 2 × (Precision × Recall) / (Precision + Recall). It gives a single number balancing both concerns, ranging from 0 (worst) to 1 (best). Use F1 when both false positives and false negatives carry meaningful business costs.

The harmonic mean penalizes extreme imbalances between precision and recall. A model with 100% precision and 1% recall scores an F1 of 0.02 — correctly flagged as near-useless. This makes F1 more informative than a simple arithmetic average (which would give 50.5%).

An F1-score above 0.85 is generally strong for binary classification. For multi-class problems, macro F1 above 0.75 is a solid baseline, though always benchmark against a domain-specific threshold rather than a universal rule.

Support: Knowing Your Data Distribution

Support is the count of actual test instances for each class. It is not a performance metric — it is a reliability indicator. Classes with support under 50 samples produce statistically unreliable precision, recall, and F1 estimates regardless of how promising the numbers appear.

Always check the support column before drawing conclusions from your report. If your minority class has only 12 test instances, an F1 of 0.75 means the model correctly handled approximately 9 of them — not a stable estimate for production deployment. Per-class evaluation with small support requires k-fold cross-validation across multiple folds to produce reliable estimates.

Support also diagnoses data distribution problems. Extremely low minority class support in your test set often signals the same underrepresentation in your training data, which directly causes poor recall for that class.

Ready to implement AI classification in your business? GrowthGear’s team has helped 50+ startups integrate ML models that drive real, measurable results. Book a Free Strategy Session to discuss your classification use case and evaluation strategy.

Reading Macro, Micro, and Weighted Averages

Beyond per-class metrics, a classification report provides three summary averages. Each averaging method answers a different question and suits different business contexts. Selecting the wrong average as your headline metric leads to misreporting model performance to stakeholders.

Macro Average: Treating All Classes Equally

Macro average computes the arithmetic mean of each metric across all classes, giving equal weight to every class regardless of instance count. Macro F1 = (F1_class1 + F1_class2 + … + F1_classN) / N. Use macro average when all classes matter equally to your business regardless of how frequently they appear.

Macro average exposes under-performance on rare but critical classes. In the fraud example above, macro F1 of 0.83 reveals the model is significantly weaker on the fraud class than the 0.97 weighted average suggests. For regulated industries where missing a specific class (fraud, disease, safety defect) has legal or reputational consequences, macro average is the right lens.

For multi-class models with many classes (20+ category classification), macro F1 can be misleadingly low if a few tail classes are genuinely ambiguous. In that case, report macro F1 separately from a “top-N class” macro F1 for the classes most important to the business.

Weighted Average: Accounting for Class Imbalance

Weighted average computes each metric’s mean weighted by each class’s support. Classes with more instances contribute proportionally more to the final score. Weighted F1 = Σ(F1_class × support_class) / total_support. Use weighted average as your headline production metric when class imbalance reflects real-world occurrence rates.

Weighted average is closest to what your model delivers in production when your deployment data matches your test distribution. The 0.97 weighted F1 in the fraud example reflects performance across the full test set — dominated by legitimate transactions, as it would be in production.

The risk: if you care more about detecting the minority class (and in fraud detection, you almost always do), weighted average masks the minority class weakness behind the majority class’s strong performance. Always pair weighted average with per-class F1 for the classes that drive business value.

Micro Average: Aggregate Pool

Micro average pools all TP, FP, and FN counts across every class first, then computes a single aggregate metric. For standard multiclass classification, micro F1 equals the accuracy score. Use micro average when you want to weight each instance equally rather than each class equally.

Micro average is most useful for multi-label classification — where one instance can belong to multiple classes simultaneously. In that setting, scikit-learn’s classification_report() with output_dict=True reports micro average instead of accuracy, and it becomes the primary comparison metric.

For standard single-label multiclass classification, weighted average typically provides more actionable insight than micro average, because micro average equals accuracy and shares accuracy’s blind spot on imbalanced data.

Business Applications and Metric Selection

Which metric to optimize from your classification report depends on the asymmetric cost of errors in your specific domain. Most businesses face different cost structures for false positives versus false negatives, and the right metric choice follows directly from that cost structure.

High-Recall Use Cases

These applications optimize for recall, accepting higher false positive rates to minimize missed detections.

  • Fraud detection: A missed fraud case costs the full transaction amount plus chargeback and investigation overhead. A false alarm costs a short customer inconvenience and a manual review. Fraud classifiers typically target recall above 0.85, then tune precision to an acceptable false positive rate.
  • Medical screening: A missed cancer diagnosis delays treatment and worsens outcomes. A false positive triggers additional testing but not direct harm. Screening programs commonly target recall above 0.90, using specificity as a secondary constraint to contain unnecessary follow-up procedures.
  • Manufacturing defect detection: Shipping a defective unit causes returns, refunds, and brand damage. Stopping a good unit for reinspection costs minutes of downtime. According to Deloitte’s 2024 Industry 4.0 analysis, AI visual inspection systems targeting 95% recall have reduced customer-facing defect rates by 50-90% across production environments.
  • Safety anomaly detection: In industrial IoT, missing an anomaly risks equipment failure or injury. Targeting recall above 0.95 for critical alerts is standard practice, with alert deduplication and priority routing to manage false alarm fatigue.

High-Precision Use Cases

These applications optimize for precision, accepting missed detections to protect against false alarm costs.

  • Spam filtering: Blocking a legitimate email costs more than letting spam through. Gmail and enterprise spam filters maintain precision above 0.99 by setting high classification thresholds, accepting that some spam reaches inboxes to prevent blocking business-critical mail.
  • Lead scoring for sales teams: Each sales rep follows up on a limited number of leads per week. Surfacing low-quality leads wastes expensive capacity. For B2B lead generation strategy, the precision of your lead scoring model determines how efficiently rep time is allocated.
  • Content moderation: Incorrectly removing a legitimate user post creates a worse experience than missing a borderline policy violation. Precision-first moderation preserves user trust, with a separate human-review queue for borderline cases near the decision boundary.
  • Credit risk flagging: Incorrectly declining a creditworthy applicant has direct revenue impact. Financial institutions balance recall (catching actual defaulters) against precision (not rejecting good customers) based on their specific risk tolerance and regulatory environment.

Setting the Right Threshold for Your Use Case

The default classification threshold in scikit-learn is 0.5 — the model predicts the positive class when its probability output exceeds 50%. Adjusting this threshold directly changes your precision-recall tradeoff and therefore every row in your classification report.

A practical threshold-setting workflow for production ML model training and deployment:

  1. Define the cost asymmetry: Quantify the cost of a false positive versus a false negative. If a missed fraud case costs $800 and a false alarm costs $15 in manual review time, your cost ratio is 53:1 — optimize heavily for recall.
  2. Plot the precision-recall curve: Evaluate precision and recall at every threshold from 0 to 1. Identify the threshold that meets your minimum requirements for both metrics simultaneously.
  3. Validate on a business metric: Translate threshold choices into business outcomes (fraud losses prevented, sales capacity saved, churn avoided) to find the threshold that maximizes business value, not just F1.
  4. Monitor in production: According to McKinsey’s State of AI 2024 report, fewer than 50% of organizations systematically monitor deployed classifier performance — a significant operational gap that leads to undetected model drift.

Pro tip: Run predict_proba() instead of predict() to get probability scores, then manually apply your optimized threshold: predictions = (model.predict_proba(X_test)[:, 1] >= threshold).astype(int). This lets you tune the threshold independently of model training.

When Classification Reports Fall Short

A classification report is a powerful diagnostic tool, but its per-class metrics are evaluated at a single threshold and assume that your test set distribution reflects production. When class imbalance is severe, costs are asymmetric, or each instance can belong to multiple classes, the standard report alone will mislead you. Three specific situations require complementary analysis methods.

Severe Class Imbalance

When your minority class represents under 1% of your dataset, per-class metrics in the classification report can be misleading at the default threshold. A model that learns to always predict the majority class achieves perfect precision for that class but zero recall for the minority — a failure mode the support column hints at but doesn’t quantify across thresholds.

For highly imbalanced problems, complement the classification report with PR-AUC (Precision-Recall Area Under Curve), which provides a threshold-independent view of the precision-recall tradeoff. Per Davis & Goadrich (2006, ICML), PR curves are more informative than ROC curves when positive instances are rare — the specific scenario where classification reports at the default threshold are most likely to mislead.

Techniques to address class imbalance alongside threshold tuning:

  • SMOTE (Synthetic Minority Over-sampling Technique): augments minority class training samples by interpolating between existing examples
  • Class weighting: class_weight='balanced' in scikit-learn adjusts loss function contributions by inverse class frequency during training
  • Threshold tuning: lowering your decision threshold from 0.5 directly increases recall for the minority class at the cost of precision

Cost-Sensitive Classification

F1-score assumes equal costs for false positives and false negatives. Real business problems rarely have symmetric costs. A fraud detection model where each missed fraud costs $1,000 but each false alarm costs $5 in review time needs a fundamentally different optimization target than F1.

For cost-sensitive problems, build a cost matrix mapping each confusion matrix cell to a dollar value, then find the threshold that minimizes total expected cost — not the one that maximizes F1. Libraries such as imbalanced-learn and cost-sensitive learning wrappers for scikit-learn support this workflow.

Multi-Label Classification

Standard classification reports assume each instance has exactly one label. When instances can belong to multiple classes simultaneously — a product tagged as both “electronics” and “home appliance”, or a document classified under multiple topics — the report’s interpretation changes.

For multi-label problems, scikit-learn’s classification_report() replaces the accuracy row with a samples-average row, and macro F1 becomes the standard headline metric. Per the Stanford HAI AI Index 2024, multi-label classification is increasingly common in enterprise content management and e-commerce catalog applications as product and content taxonomies grow more complex.

For CNN image classifiers that output multi-class predictions (where each image belongs to exactly one class), standard single-label classification reports apply and macro F1 is the recommended headline metric for evaluating performance across all output categories.

Your Complete Evaluation Pipeline

A classification report works best as part of a broader evaluation workflow. Connecting your classifier performance to downstream outcomes — for example, using Google Analytics 4 to track whether precision improvements in lead scoring actually increase conversion rates — closes the loop between model metrics and business results.

A complete ML evaluation sequence:

  1. Classification report — Per-class precision, recall, F1, support at your chosen threshold
  2. Confusion matrix visualization — Spatial view of which classes get confused with each other, particularly useful for multi-class models with 10+ output categories
  3. ROC-AUC — Threshold-independent model comparison; use for binary classification to compare models or datasets
  4. PR-AUC — Threshold-independent precision-recall tradeoff; preferred over ROC-AUC when positive class instances are under 5% of data
  5. Calibration curve — Verify that your model’s probability outputs are calibrated (does 70% predicted probability actually correspond to 70% observed likelihood?)

Classification Report Reference: Metrics at a Glance

MetricFormulaWhen to PrioritizeScikit-learn average=
PrecisionTP / (TP + FP)False positives are costly (spam, lead scoring)'macro' or per class
RecallTP / (TP + FN)False negatives are costly (fraud, medical, safety)'macro' or per class
F1-Score2×P×R / (P+R)Balance both error types symmetrically'macro' or 'weighted'
SupportCount of actual instancesReliability check — low support = unreliable metricsN/A (always shown)
Macro AvgSimple mean across classesAll classes matter equally regardless of frequency'macro'
Weighted AvgSupport-weighted meanProduction headline metric for imbalanced data'weighted'
Micro AvgPooled TP/FP/FN across classesMulti-label tasks; equals accuracy for single-label'micro'

Take the Next Step

Reading a classification report is step one. Translating its metrics into threshold decisions, retraining strategies, and business KPIs is what separates a working model from a production system that drives measurable outcomes.

Whether you’re building a fraud detection model that needs to balance recall against false positive volume, or a multi-class classifier for customer segmentation, GrowthGear can help you design the evaluation framework and deployment pipeline to match your business cost structure.

Book a Free Strategy Session →


Frequently Asked Questions

A classification report shows precision, recall, F1-score, and support for each class in a classifier. Generated by scikit-learn's classification_report(), it reveals per-class performance that accuracy alone cannot surface, especially on imbalanced datasets.

A deep learning classifier's report derives from the confusion matrix: precision = TP/(TP+FP), recall = TP/(TP+FN), F1 = 2×(P×R)/(P+R), and support = actual instances per class. Deep learning models output probabilities via softmax; argmax gives the predicted class.

An F1-score above 0.85 is generally strong for binary classification. For multi-class problems, macro F1 above 0.75 is good. Always benchmark against a domain baseline — medical screening commonly targets F1 above 0.90.

Macro average treats all classes equally regardless of size. Weighted average weights each class by its support. Use weighted average as your headline metric for imbalanced data; use macro when all classes carry equal business importance.

Use a classification report to diagnose per-class performance at a specific threshold. Use ROC-AUC for threshold-independent model comparison. For severe class imbalance under 5% positive rate, Davis & Goadrich 2006 (ICML) recommends PR-AUC over ROC-AUC.

Support is the count of actual test instances for each class. It is a reliability indicator, not a performance metric. Classes with support under 50 samples produce unreliable precision and recall estimates regardless of how good the numbers appear.

A model predicting the majority class 100% of the time gets high accuracy on imbalanced data but zero recall for the minority class. Per Google ML Crash Course, a fraud classifier can achieve 99% accuracy on 1% fraud prevalence while catching zero actual fraud cases.