Home  >  Enterprise Wizard Spider and Sandworm  >  Detection Categories

Wizard Spider and Sandworm Evaluation: Detection Categories

The evaluation focuses on articulating how detections occur, rather than assigning scores to vendor capabilities.

We organize detections according to each substep (i.e., implementation of a technique). For a detection to be included for a given substep, it must apply to the specific technique-under-test (i.e., the detection must apply to the one technique associated with that substep, not other or all techniques of that Step). For each detection, we require that proof/evidence be provided to us, but we may not include all detection details in public results, particularly when those details are sensitive. While we make every effort to capture all relevant detections, vendor capabilities may be able to detect procedures in ways that we did not capture.

Starting with the Wizard Spider and Sandworm evaluations, each substep has a single detection category that represents the highest level of context provided to the analyst across all detections for that substep. For reference, the context provided by each detection category increases from left to right, with Technique being the highest context within the detection category diagram. An image gallery will display evidence of the detection that generated that detection category for that substep as well as other relevant images from other detections associated with that substep. Data sources will be tied to each screenshot within the gallery. Detections captured that require detection category modifiers (ex: configuration change or delayed) will be separated within a substep’s results to clearly identify the different type of detections and allow users to more easily filter the results based on whether to include or exclude these types of detections. To determine the appropriate category for a detection, we review the evidence provided, notes taken during the evaluation, results of follow-up questions to the vendor, and vendor feedback on draft results. We also independently test procedures in a separate lab environment and review open-source tool detections and forensic artifacts. This testing informs what is considered to be a detection for each technique.

After performing detection categorizations, we calibrate the categories across all vendors to look for discrepancies and ensure categories are applied consistently. The decision of what category to apply is ultimately based on human analysis and is therefore subject to discretion and biases inherent in all human analysis, although we do make efforts to hedge against these biases by structuring analysis as described above.

Data Sources

Screenshots will be tagged with the data source(s) that signify the type of data used to generate the detection. This will be used to differentiate and provide more precise descriptions of similar detections (ex: telemetry from file monitoring versus process command-line arguments). The list of possible data source tags will be calibrated by MITRE Engenuity after execution of the evaluations.

Detection Categories

Vendor did not have visibility on the system under test. The vendor must state before the evaluation what systems they did not deploy a sensor on to enable Not Applicable to be in scope for relevant steps.

No sensor was deployed in the Linux systems within the environment to capture command-line activity, which would have been required to satisfy the detection criteria of the technique under test.

No data was made available within the capability related to the behavior under test that satisfies the assigned detection criteria. There are no modifiers, notes, or screenshots included with a None.

Minimally processed data collected by the capability showing that event(s) occurred specific to the behavior under test that satisfies the assigned detection criteria. Evidence must show definitively that behavior occurred and be related to the execution mechanism (did happen vs may have happened). This data must be visible natively within the tool and can include data retrieved from the endpoint.


Command-line output is produced that shows a certain command was run on a workstation by a given username.

There is a remote shell component within the capability that can be used to pull native OS logs from a system suspected of being compromised for further analysis.

Processed data specifying that malicious/abnormal event(s) occurred, with relation to the behavior under test. No or limited details are provided as to why the action was performed (tactic), or details for how the action was performed (technique).


A detection describing "cmd.exe /c copy cmd.exe sethc.exe" as abnormal/malicious activity, but not stating it's related to Accessibility Features or a more specific description of what occurred.

A “Suspicious File” detection triggered upon initial execution of the executable file.

A detection stating that "suspicious activity occurred" related to an action but did not provide detail regarding the technique under test.

Processed data specifying ATT&CK Tactic or equivalent level of enrichment to the data collected by the capability. Gives the analyst information on the potential intent of the activity or helps answer the question "why this would be done". To qualify as a detection, there must be more than a label on the event identifying the ATT&CK Tactic, and it must clearly connect a tactic-level description with the technique under-test.


A detection called “Malicious Discovery” is triggered on a series of discovery techniques. The detection does not identify the specific type of discovery performed.

A detection describing that persistence occurred but not specifying how persistence was achieved.

Processed data specifying ATT&CK Technique, Sub-Technique or equivalent level of enrichment to the data collected by the capability. Gives the analyst information on how the action was performed or helps answer the question "what was done" (i.e. Accessibility Features or Credential Dumping). To qualify as a detection, there must be more than a label on the event identifying the ATT&CK Technique ID (TID), and it must clearly connect a technique-level description with the technique under-test.


A detection called "Credential Dumping" is triggered with enough detail to show what process originated the behavior against lsass.exe and/or provides detail on what type of credential dumping occurred.

A detection for "Lateral Movement with Service Execution" is triggered describing what service launched and what system was targeted.

Modifier Detection Types

The configuration of the capability was changed since the start of the evaluation. This may be done to show additional data can be collected and/or processed. The Configuration Change modifier may be applied with additional modifiers describing the nature of the change, to include:

  • Data Sources – Changes made to collect new information by the sensor.
  • Detection Logic – Changes made to data processing logic.
  • UX – Changes related to the display of data that was already collected but not visible to the user.


The sensor is reconfigured to is created to enables the capability to monitor file activity related to data collection. This would be labeled with a modifier for Configuration Change-Data Sources.

A new rule is created, a pre-existing rule enabled, or sensitivities (e.g., blacklists) changed to successfully trigger during a retest. These would be labeled with a modifier Configuration Change-Detection Logic.

Data showing account creation is collected on the backend but not displayed to the end user by default. The vendor changes a backend setting to allow Telemetry on account creation to be displayed in the user interface, so a detection of Telemetry and Configuration Change-UX would be given for the Create Account technique.

The detection is not immediately available to the analyst due to additional processing unavailable due to some factor that slows or defers its presentation to the user, for example subsequent or additional processing produce a detection for the activity. The Delayed category is not applied for normal automated data ingestion and routine processing taking minimal time for data to appear to the user, nor is it applied due to range or connectivity issues that are unrelated to the capability itself. The Delayed modifier will always be applied with modifiers describing more detail about the nature of the delay.

The capability uses machine learning algorithms that trigger a detection on credential dumping after the normal data ingestion period. This detection would receive a Modifier detection category of Delayed with a description of the additional processing time.
We differentiate between types of detection to provide more context around the capabilities a vendor offers in a way that allows end users to weigh, score, or rank the types of detection against their needs. This approach allows end users of the results to determine what they value most in a detection (e.g. some organizations may want telemetry, while others would want Technique detection).