Security Scanner
Learn how our system downloads, extracts, and analyzes Chrome extensions to identify potentially malicious behavior. Each file type is routed to a specialized scanner that extracts structural and behavioral features for classification.
After receiving an extension ID, our system downloads the corresponding extension package
(either a .zip or .crx file) from sources such as the Chrome Web Store
or Chrome Stats. Once downloaded, the contents are extracted so that we can analyze each
component of the extension individually.
The extracted files are sorted into several categories based on their type: JavaScript, HTML, CSS, JSON, and a catch-all bin for any remaining file types. Organizing files this way allows each file to be routed to a specialized scanner designed for that specific format.
Each scanner extracts features that may indicate potentially malicious behavior — including
both structural characteristics (such as formatting patterns or entropy) and behavioral
indicators (such as suspicious function calls or risky permissions). All extracted features
are aggregated into a centralized Extension class, which acts as a container
for metadata, file contents, and collected features, ready to be passed to our machine
learning models for classification.
JavaScript files are first run through JSBeautify to reverse minification and restore a readable structure. The beautified code is then parsed with Esprima into an Abstract Syntax Tree (AST), which allows us to systematically analyze relationships between functions, variables, and expressions.
Structural features identify suspicious formatting patterns that may indicate packed or obfuscated code — for example, very long lines, low whitespace percentage, or unusually high string entropy.
Behavioral features capture potentially dangerous functionality. We traverse the AST
for CallExpression nodes to locate risky APIs commonly abused in malicious
extensions.
eval callsHTML files are analyzed using Python's native file reading functionality combined with rule-based detection and pattern matching. The scanner looks for elements commonly used in web-based attacks — such as hidden frames, embedded external content, or inline JavaScript execution — that could inject malicious scripts or silently redirect users.
num_object_tagsnum_embed_tagsnum_applet_tagsnum_inline_event_handlersnum_javascript_urlsnum_data_urlsnum_external_script_srcnum_meta_refreshnum_iframe_tagsnum_external_iframe_srcnum_form_tagsnum_external_form_actionsnum_password_inputsFeature extraction works by scanning the HTML structure and counting occurrences of specific tags, attributes, and URL patterns.
Every Chrome extension includes a manifest.json file that defines its
configuration and permissions. We load this file using Python's built-in json
module, which converts the data into a Python dictionary for straightforward key-based
extraction.
Extensions requesting excessive permissions or allowing content from untrusted domains may pose significant security concerns. The manifest provides high-signal metadata that complements the behavioral features extracted from code files.
CSS files are analyzed using regex-based pattern matching, scanning line by line to track the frequency of properties and directives that can sometimes be abused to load external resources or alter page behavior unexpectedly.
background-image propertiesbehavior properties@import rules
Regular expressions detect the presence of each pattern across all CSS files in the
extension. The scanner records how frequently each feature appears, and the results
are added to the feature set stored in the Extension class.