FIE

Security Scanner

Scanner

Scanner Pipeline

Learn how our system downloads, extracts, and analyzes Chrome extensions to identify potentially malicious behavior. Each file type is routed to a specialized scanner that extracts structural and behavioral features for classification.

Jump to:
Overview JS Scanner HTML Scanner Manifest Scanner CSS Scanner

Scanning Overview

After receiving an extension ID, our system downloads the corresponding extension package (either a .zip or .crx file) from sources such as the Chrome Web Store or Chrome Stats. Once downloaded, the contents are extracted so that we can analyze each component of the extension individually.

The extracted files are sorted into several categories based on their type: JavaScript, HTML, CSS, JSON, and a catch-all bin for any remaining file types. Organizing files this way allows each file to be routed to a specialized scanner designed for that specific format.

Each scanner extracts features that may indicate potentially malicious behavior — including both structural characteristics (such as formatting patterns or entropy) and behavioral indicators (such as suspicious function calls or risky permissions). All extracted features are aggregated into a centralized Extension class, which acts as a container for metadata, file contents, and collected features, ready to be passed to our machine learning models for classification.

JavaScript Scanner

JavaScript files are first run through JSBeautify to reverse minification and restore a readable structure. The beautified code is then parsed with Esprima into an Abstract Syntax Tree (AST), which allows us to systematically analyze relationships between functions, variables, and expressions.

Structural Features

Structural features identify suspicious formatting patterns that may indicate packed or obfuscated code — for example, very long lines, low whitespace percentage, or unusually high string entropy.

  • Average line length
  • Frequency of specific characters
  • Average word size
  • String entropy
  • Keyword density

Behavioral Features

Behavioral features capture potentially dangerous functionality. We traverse the AST for CallExpression nodes to locate risky APIs commonly abused in malicious extensions.

  • Code generation functions
  • DOM change methods
  • Event handlers
  • Number of HTTPS scripts
  • Modification callbacks
  • XMLHttpRequests
  • eval calls

HTML Scanner

HTML files are analyzed using Python's native file reading functionality combined with rule-based detection and pattern matching. The scanner looks for elements commonly used in web-based attacks — such as hidden frames, embedded external content, or inline JavaScript execution — that could inject malicious scripts or silently redirect users.

Suspicious Objects & XSS Vectors

  • num_object_tags
  • num_embed_tags
  • num_applet_tags
  • num_inline_event_handlers
  • num_javascript_urls
  • num_data_urls
  • num_external_script_src
  • num_meta_refresh

Iframes & Forms

  • num_iframe_tags
  • num_external_iframe_src
  • num_form_tags
  • num_external_form_actions
  • num_password_inputs

Feature extraction works by scanning the HTML structure and counting occurrences of specific tags, attributes, and URL patterns.

Manifest Scanner

Every Chrome extension includes a manifest.json file that defines its configuration and permissions. We load this file using Python's built-in json module, which converts the data into a Python dictionary for straightforward key-based extraction.

Features Extracted

  • Permissions requested by the extension
  • Whether a Content Security Policy (CSP) is defined
  • Domains from which external content is allowed to load

Why It Matters

Extensions requesting excessive permissions or allowing content from untrusted domains may pose significant security concerns. The manifest provides high-signal metadata that complements the behavioral features extracted from code files.

CSS Scanner

CSS files are analyzed using regex-based pattern matching, scanning line by line to track the frequency of properties and directives that can sometimes be abused to load external resources or alter page behavior unexpectedly.

Features Extracted

  • background-image properties
  • behavior properties
  • @import rules

Extraction Method

Regular expressions detect the presence of each pattern across all CSS files in the extension. The scanner records how frequently each feature appears, and the results are added to the feature set stored in the Extension class.

← Back to scanner