7 Computer-Based Methods
After presenting research methods with human participants, we continue with methods that are either executed on computer systems alone (benchmarking, archival data analysis, and simulation) or reason about computer behavior (proofs). Again, this is not an exhaustive list, and the individual techniques can be mixed and matched according to the specific needs.
7.1 Benchmarking
In many subfields of computer science, we need to compare multiple systems or approaches according to some criterion. This is commonly accomplished with a benchmark study.
A benchmark is a standardized test (a program or a dataset), combined with a prescribed usage procedure, whose main purpose is to evaluate and compare systems with respect to their performance, reliability, or other characteristics. Based on a set of inputs, operations, and conditions called workload, it produces metrics, such as execution time or F1 score.
The main advantage is an easy and fair comparison of multiple systems, which fosters competition and advances research on the given topic. A disadvantage is that researchers can start optimizing their approaches toward the benchmark, without considering practical applicability.
According to Gray (1992) and Huppler (2009), a good benchmark needs to be:
- Relevant: It should measure the performance of typical real-world operations and be of practical importance.
- Repeatable: We should be able to reproduce the results by running the benchmark again under similar conditions.
- Portable: It should be easy to port the benchmark to other systems or architectures.
- Scalable: A benchmark should be executable with a smaller/lighter workload to be economical enough, but also with a larger one for realism and precision.
Kounev et al. (2025) divide benchmarks into specification-based, kit-based, and hybrid. Specification-based benchmarks provide natural-language description of the workload, while the implementation is up to the user. This allows for greater customizability, but reproducibility is challenging. Kit-based benchmarks provide a complete implementation and a hybrid benchmark only a partial one.
7.1.1 Application
There are two common ways to apply benchmarking in a research study. The first way answers RQs such as “Is X faster/more efficient than Y?” or “Which approach is the best for X?”. Here, we compare a target approach (often the one proposed in the paper) against a baseline or, even better, multiple baselines. A baseline is a relevant competitor of the target approach. Baselines can include a naive algorithm or a commonly used approach in practice. However, the most important baselines are state-of-the-art approaches, which are the currently best ones, according to the specified criteria, published in research literature. Even if a state-of-the-art approach is not applicable to a formal benchmark at this time due to some technical limitations, we should try to compare the target approach with it to the maximum extent possible.
Second, we can compare multiple variants, parameters, or input sizes for one approach. For instance, an algorithm has two variants, non-cached and cached, where the cache size is configurable in kilobytes. We can plot the non-cached variant and various cache sizes against performance. This could answer questions such as “How does X scale?” or “What are the optimal values of X?”.
7.1.2 Examples
The details of benchmark application vary greatly across computer science subfields and the specific metrics that the benchmark produces. As an example, we will explain time-based performance measurement of programs. Suppose we would like to measure how various compiler options affect the speed of a running program. After compiling the benchmark with given command-line options, we execute it on an otherwise idle machine. Possible metrics that the benchmark could output include:
- real time (wall clock time): the total execution time, including input/output operations and waiting,
- user and system time: time spent executing processor instructions in the user mode and the operating system’s kernel mode, respectively,
- CPU time: a sum of the user and system time,
- and throughput: the number of operations (e.g., mathematical computations or data items processed) per second.
Often the benchmark can and should be configured to run the workload multiple times to get statistically more stable. Then, from these basic metrics, a median, mean, standard deviation, and percentiles can be computed. Some benchmarks, especially for just-in-time compiled runtimes, include warmup rounds that are excluded from results.
Configuring the benchmark for multiple executions is crucial when one execution takes too little time, such as a few milliseconds. Short executions would be dominated by noise in the measurement coming from the operating system process switches, imprecise timers, and initialization of caches. Note that the benchmark itself should support this configuration; simply running the whole program separately multiple time typically does not solve this problem.
As specific examples of papers describing the development of a new benchmark, we can mention the DaCapo benchmark suite for Java performance (Blackburn et al., 2006) and BBEH for natural language reasoning tasks (Kazemi et al., 2025).
7.2 Archival Data Analysis
Instead of performing research studies directly with human participants, we can resort to the analysis of existing artifacts that people produced without the intention of a participation in research. This is often less resource-intensive and offers a look into natural settings instead of full control. Similarly to studies performed with humans, we should not forget to anonymize any personal information that we encounter.
Archival data analysis is applicable in numerous computer science subfields and has many different forms. We selected two specific examples: mining software repositories (MSR) and the analysis of discussion forums.
7.2.1 Software Repository Mining
Software repository mining is a method in the area of software engineering that aims to answer research questions by systematic analysis of data present in version control systems, issue trackers, mailing lists, and similar systems. It is useful to answer a wide range of questions, from “How to predict which commit introduces a bug?” (X. Yang et al., 2015) to “Do programmers work at night or during weekends?” (Claes et al., 2018).
After posing research question(s), we choose a repository or a set of repositories that contains data suitable for answering it. This is often a large open-source project hosting website such as GitHub, which offers an integrated issue tracking and continuous integration features, but it can also be a collection of smaller, organization-specific repositories (e.g., Apache or Mozilla) interconnected with external systems. Unfortunately, MSR tends to have a deficit of industrial studies due to intellectual property problems.
Then, we clearly specify inclusion and exclusion criteria. For instance, we can limit the repositories to certain programming languages, libraries used, last update dates, and other criteria according to the purpose of the study. Sampling (Section 6.2.1) may be necessary if the set of projects is too large. If using GitHub, excluding noise in the form of toy projects, personal backups, or homework assignments is also advised (Munaiah et al., 2017).
Data downloading and extraction follows. We recommend reserving enough time for this phase, as software hosting sites often impose API rate limitations. The downloaded data should be saved in standard, platform-independent formats. A subset of data necessary for the purpose of the study is extracted and processed. The nature of processing depends on the nature of the study and can include descriptive statistics, correlational analysis, machine learning, and even manual analysis of parts of the data.
Vidoni (2022) provides a systematic review of MSR studies, along with a generic description of the process most studies use. Kalliamvakou et al. (2014) list nine facts that should be taken into account when doing a MSR study using GitHub to prevent significant threats to validity. For instance, many projects are inactive, have very few commits, or do not use the GitHub’s pull request feature.
7.2.2 Analysis of Discussion Forums
Another common method utilizing data from humans without involving them in a research study is the analysis of Internet forums, Q&A (question and answers) websites, and social networks.
A typical method begins with posing research questions and selecting appropriate websites. Next, we decide which kinds of data we need: posts, likes/votes, creation dates, authorship relationships, etc. We continue with sampling of the given posts, using inclusion and exclusion criteria that we define. After data extraction, it is necessary to preprocess the the data, as particularly the text of posts often contains formatting marks or has character encoding issues. The analysis can be performed in a quantitative, qualitative, or mixed methods manner. For instance, the text of the posts can be qualitatively coded, similarly to the case of interviews (Section 6.3.1). A social network can be modeled as a directed or undirected graph and quantitatively analyzed using graph algorithms (Tabassum et al., 2018).
In the area of software engineering, hundreds of papers using the Stack Exchange network, particularly the Stack Overflow Q&A website, were published. They range from general analysis of topics – “What topics are mainly discussed on Stack Overflow?” (Barua et al., 2014) to the comparison of ChatGPT answers’ quality with Stack Overflow (Kabir et al., 2024). The analysis of Q&A websites is often combined the software repository mining (D. Yang et al., 2017).
7.3 Simulation
Computer simulation is a method consisting of the development of a mathematical model and its subsequent execution to derive results. In other words, it is an approximate execution of a process in a contrived setting.
Simulation is particularly useful when the real executions would be prohibitively expensive, time-consuming, unethical, or difficult to observe. It can be used to answer many “what-if” RQs, such as “What if X happened?” or “How does X probably behave under condition Y?”
Computer simulation is used in a multitude of research fields; human-computer interaction is among the more interesting ones. Some papers focus on specific applications, such as the one by Dubey et al. (2021), which describes a simulation model for wayfinding using signage in 3D space. On the other hand, Park et al. (2023) devised a general simulation of a village with 25 agents performing human-like behavior.
7.4 Proofs
Finally, we briefly mention mathematical proofs, pertaining to non-empirical methods, for completeness. By a proof, we understand a rigorous, logical sequence of mathematical statements showing that a certain statement is true.
Generally, the research questions that proofs aim to answer are of type “Does X exist?” or “Does X hold for all Ys?” Formal proofs are heavily used in theoretical computer science. Other areas relying on proofs are programming language theory, cryptography, and algorithm design.
Proofs are useful alone, but sometimes they are combined with empirical methods. For instance, when proposing an algorithm, its time and space complexity can be proved mathematically and then measured empirically on a range of inputs to show how it behaves in reality, taking hardware peculiarities into account.
Lamport (2012) provides useful tips on how to write readable and more understandable proofs. According to him, we should break them into hierarchical structures, with high-level proof sketches first and details deferred to a later part. A machine-checkable version should also be supplied if possible.
Exercises
- Discuss a real or potential situation where the researchers in some computer science subfield started concentrating too much on beating a benchmark, while factors limiting the application in practice were neglected.
- Find at least two papers that use the same benchmark in their evaluation. Was the benchmark itself published in another paper? Are the evaluation results directly comparable, or are there discrepancies (e.g., different metrics or conditions)?
- Find a study that mines software repositories or analyzes discussion forums. What were the inclusion and exclusion criteria for repositories/posts? Were all matching repositories/posts analyzed or was sampling used?
- Suppose you prove that an algorithm has an average-case time complexity of \(O(n^3)\), but after implementing it, executing on real data, and plotting the results, you find out the chart seems more like \(O(n^2)\). What could be the reasons?