Appendix A: Methodology
One of the things readers value most about this report is the level of rigor and integrity we employ when collecting, analyzing, and presenting data.
Knowing our readership cares about such things and consumes this information with a keen eye helps keep us honest. Detailing our methods is an important part of that honesty.
First, we make mistakes. A column transposed here; a number not updated there. We’re likely to discover a few things to fix. When we do, we’ll list them on our corrections page: verizon.com/business/resources/reports/dbir/2022/corrections/.
Second, we check our work. The same way the data behind the DBIR figures can be found in our GitHub repository32, as with last year, we’re also publishing our fact check report there as well. It’s highly technical, but for those interested, we’ve attempted to test every fact in the report.
Third, science comes in two flavors: creative exploration and causal hypothesis testing. The DBIR is squarely in the former. While we may not be perfect, we believe we provide the best obtainable version of the truth, (to a given level of confidence and under the influence of biases acknowledged below). However, proving causality is best left to randomized control trials. The best we can do is correlation. And while correlation is not causation, they are often related to some extent, and often useful.
Non-committal Disclaimer
We would like to reiterate that we make no claim that the findings of this report are representative of all data breaches in all organizations at all times. Even though we believe the combined records from all our contributors more closely reflect reality than any of them in isolation, it is still a sample. And although we believe many of the findings presented in this report to be appropriate for generalization (and our conviction in this grows as we gather more data and compare it to that of others), bias exists.
The DBIR Process
Our overall process remains intact and largely unchanged from previous years33. All incidents included in this report were reviewed and converted (if necessary) into the VERIS framework to create a common, anonymous aggregate data set. If you are unfamiliar with the VERIS framework, it is short for Vocabulary for Event Recording and Incident Sharing, it is free to use, and links to VERIS resources are at the beginning of this report. The collection method and conversion techniques differed between contributors. In general, three basic methods (expounded below) were used to accomplish this:
- Direct recording of paid external forensic investigations and related intelligence operations conducted by Verizon using the VERIS Webapp
- Direct recording by partners using VERIS
- Converting partners’ existing schema into VERIS
All contributors received instruction to omit any information that might identify organizations or individuals involved.
Some source spreadsheets are converted to our standard spreadsheet formatted through automated mapping to ensure consistent conversion. Reviewed spreadsheets and VERIS Webapp JavaScript Object Notation (JSON) are ingested by an automated workflow that converts the incidents and breaches within into the VERIS JSON format as necessary, adds missing enumerations, and then validates the record against business logic and the VERIS schema. The automated workflow subsets the data and analyzes the results. Based on the results of this exploratory analysis, the validation logs from the workflow, and discussions with the partners providing the data, the data is cleaned and re-analyzed. This process runs nightly for roughly two months as data is collected and analyzed.
Incident Data
Our data is non-exclusively multinomial, meaning a single feature, such as “Action,” can have multiple values (i.e., “social,” “malware” and “hacking”). This means that percentages do not necessarily add up to 100%. For example, if there are 5 botnet breaches, the sample size is 5. However, since each botnet used phishing, installed keyloggers, and used stolen credentials, there would be 5 social actions, 5 hacking actions, and 5 malware actions, adding up to 300%. This is normal, expected, and handled correctly in our analysis and tooling.
Another important point is that when looking at the findings, “unknown” is equivalent to “unmeasured”. Which is to say that if a record (or collection of records) contains elements that have been marked as “unknown” (whether it is something as basic as the number of records involved in the incident, or as complex as what specific capabilities a piece of malware contained) it means that we cannot make statements about that particular element as it stands in the record—we cannot measure where we have too little information. Because they are ‘unmeasured’, they are not counted in sample sizes. The enumeration “Other” is counted, however, as it means the value was known but not part of VERIS (or not one of the other bars if found in a bar chart). Finally, “Not Applicable” (normally “NA”) may be counted or not counted depending on the claim being analyzed.
This year we have again made liberal use of confidence intervals to allow us to analyze smaller sample sizes. We have adopted a few rules to help minimize bias in reading such data. Here we define ‘small sample’ as less than 30 samples.
- Samples sizes smaller than five are too small to analyze
- We won’t talk about count or percentage for small samples. This goes for figures too and is why some figures lack the dot for the median frequency.
- For small samples we may talk about the value being in some range, or values being greater/less than each other. These all follow the confidence interval approaches listed above
Incident Eligibility
For a potential entry to be eligible for the incident/breach corpus, a couple of requirements must be met. The entry must be a confirmed security incident defined as a loss of Confidentiality, Integrity, or Availability. In addition to meeting the baseline definition of “security incident” the entry is assessed for quality. We create a subset of incidents (more on subsets later) that pass our quality filter. The details of what makes a “quality” incident are:
- The incident must have at least seven enumerations (e.g. threat actor variety, threat action category, variety of integrity loss, et al.) across 34 fields OR be a DDoS attack. Exceptions are given to confirmed data breaches with less than seven enumerations
- The incident must have at least one known VERIS threat action category (hacking, malware, etc.)
In addition to having the level of details necessary to pass the quality filter, the incident must be within the timeframe of analysis, (November 1, 2020 to October 31, 2021 for this report). The 2021 caseload is the primary analytical focus of the report, but the entire range of data is referenced throughout, notably in trending graphs. We also exclude incidents and breaches affecting individuals that cannot be tied to an organizational attribute loss. If your friend’s laptop was hit with Emotet it would not be included in this report.
Lastly, for something to be eligible for inclusion into the DBIR, we have to know about it, which brings us to several potential biases we will discuss next.
Acknowledgement and analysis of bias
Many breaches go unreported (though our sample does contain many of those). Many more are as yet unknown by the victim (and thereby unknown to us). Therefore, until we (or someone) can conduct an exhaustive census of every breach that happens in the entire world each year (our study population), we must use sampling. Unfortunately, this process introduces bias.
The first type of bias is random bias introduced by sampling. This year, our maximum confidence is +/- 0.7% for incidents and +/- 1.4% for breaches, which is related to our sample size. Any subset with a smaller sample size is going to have a wider confidence margin. We’ve expressed this confidence in the complementary cumulative density (slanted) bar charts, hypothetical outcome plot (spaghetti) line charts, quantile dot plots and pictograms.
The second source of bias is sampling bias. We strive for “the best obtainable version of the truth” by collecting breaches from a wide variety of contributors. Still, it is clear that we conduct biased sampling. For instance, some breaches, such as those publicly disclosed, are more likely to enter our corpus, while others, such as classified breaches, are less likely.
The below four figures are an attempt to visualize potential sampling bias. Each radial axis is a VERIS enumeration and we have stacked bar charts representing our data contributors. Ideally, we want the distribution of sources to be roughly equal on the stacked bar charts along all axes. Axes represented by only a single source are more likely to be biased. However, contributions are inherently thick tailed, with a few contributors providing a lot of data and a lot of contributors providing a few records within a certain area. Still, we mostly see that most axes have multiple large contributors with small contributors adding appreciably to the total incidents along that axis.
David M. Smith
Assistant Director
U.S. Secret Service