FIE

Security Scanner

Scanner

Machine Learning Pipeline

Learn how our project classifies Chrome extensions as benign or malicious using engineered static features, feature selection, and supervised learning. Find summaries of our evaluation results and visual explanations (feature importance + SHAP) down below.

Jump to:
Dataset Feature Engineering Random Forest SVM-RBF Results Visualizations

Dataset Construction

To construct our dataset, we compiled a collection of known malicious and benign Chrome extension IDs. Extension IDs were gathered in large batches using a third-party extension tool, Chrome-Stats.

Benign Criteria

  • Published on the Chrome Web Store by verified publishers
  • Store rating > 4.0 / 5
  • Low risk score according to Chrome-Stats' scoring metric

Malicious Criteria

  • Removed from the Chrome Web Store after being classified as malware
Total labeled samples
5,867
Malicious / Benign
3,001 / 2,866
Engineered features per sample
49
Train / Test split
80% / 20%

Each labeled extension was scanned using our feature extraction pipeline to produce a structured feature vector. We split the dataset into train/test before training to reduce leakage risk and ensure fair evaluation.

Feature Engineering

Our pipeline extracts static indicators from extension files and metadata. The goal is to capture behaviors commonly associated with malicious extensions — such as dynamic code execution, suspicious DOM manipulation, and risky external connections — using measurable, engineered features.

Examples of feature groups

  • JavaScript signals (dynamic code gen functions, suspicious objects, event handlers)
  • HTML/DOM signals (DOM operations, sinks density, iframe/form indicators, XSS vectors)
  • Statistical signals (string entropy, whitespace %, avg line length, keyword density)
  • Network/URL signals (external URL counts, http/https domains)

Implementation environment

Models were implemented in scikit-learn (Python). We used standard evaluation utilities and saved trained artifacts (e.g., via joblib). We also generated explanation plots (e.g., SHAP) to visualize global feature influence.

Random Forest: Feature Ranking and Model Selection

We used Random Forest as the main feature-ranking stage in our machine learning pipeline. Rather than using all extracted features directly, we first used Random Forest to estimate which features contributed most strongly to malicious-versus-benign classification, then passed the top-ranked subset into our final SVM-RBF model.

Step 1: Optimize forest size

To choose the number of trees, we trained forests from 15 to 500 estimators and measured out-of-bag error for each one. OOB error is computed from bootstrap samples and gives an internal estimate of how well the forest generalizes without requiring a separate validation split. We selected the forest size that minimized OOB error before running feature selection.

Selected forest size: 275 trees

Step 2: Select top-k features

With the tree count fixed, we ran 5-fold stratified cross-validation over feature subset sizes. For each fold, we trained a Random Forest on the training split, ranked the features by feature importance, selected the top k features, retrained the model using only those features, and evaluated performance on the validation split.

We repeated this process for every subset size from 1 to n and chose the value of k with the lowest average Mean Absolute Error (MAE) across folds. In our binary setting, this acts as an average prediction error measure over benign/malicious labels.

Selection metric: 5-fold CV MAE : 0.128

Step 3: Export ranked feature list

After identifying the optimal subset size, we retrained Random Forest on the training data, produced a final descending ranking of all features, and saved that ordering for later use. This ranked list became the basis for the feature subset used in our final calibrated SVM-RBF classifier.

SVM-RBF: Final Classification Model

We used a Support Vector Machine with an RBF kernel as our final classifier for malicious extension detection. The model was trained on a reduced input space of 39 Random-Forest-ranked features, allowing us to focus on the most informative static and behavioral indicators extracted from extension code.

The training pipeline consisted of StandardScaler followed by SVC(kernel="rbf"). Feature scaling was necessary because SVMs are sensitive to magnitude differences across input variables, and the RBF kernel was chosen to capture non-linear relationships that a linear separator would miss.

Model selection was performed with GridSearchCV over C, gamma, and class_weight using 5-fold stratified cross-validation. We optimized for average precision rather than raw accuracy so the chosen model would better handle class imbalance and prioritize malicious-class retrieval quality.

After training, we applied sigmoid probability calibration with CalibratedClassifierCV and evaluated threshold-free metrics such as ROC-AUC and PR-AUC, along with thresholded metrics including precision, recall, F1, balanced accuracy, and MCC. We compared predictions at the default threshold of 0.5, the max-F1 threshold, and a Youden’s J threshold, then saved the calibrated model, selected feature list, best hyperparameters, and final operating threshold into the deployment bundle.

Final feature subset: 39 features

Test Evaluation

Below we report our final test-set performance. These values come from the evaluation artifacts generated in our training pipeline.

Key metrics (RF Test)

Accuracy: (0.853)
Precision: (0.744)
Recall: (0.853)
F1-score: (0.795)
ROC-AUC: (0.922)
PR-AUC / Avg Precision: (0.885)

Key metrics (SVM-RBF Test)

Accuracy: (0.811)
Precision: (0.707)
Recall: (0.74)
F1-score: (0.723)
ROC-AUC: (0.856)
PR-AUC / Avg Precision: (0.764)

Notes

In a malware detection setting, recall often matters because false negatives are costly. However, we also monitor precision to limit false positives. Threshold selection can shift this tradeoff.

Visualizations

These plots help explain model behavior and performance.

← Back to scanner