Table of Contents
- What is the Digital Standard?
- Who Created and Maintains the Digital Standard? Who can Contribute?
- Why is Testing Important?
- Why was this Testing Handbook Necessary, and Who is it For?
- How does the Handbook Score Products?
- How did we Pick the Products? (And Why aren’t We Naming Them?)
- What Products did we Ultimately Choose?
- How did we Design the Technical Testing Procedures?
- How did we Design the Policy Testing Procedures?
- What would we Change in the Standard?
- Conclusion
How did we Design the Technical Testing Procedures?
As we prepared the testing handbook, we realized that some tests included in the standard are much more technically complicated than others. Many of the technical tests require an in-depth understanding of mobile application development and digital security principles, as well as a comprehensive understanding of the features contained within a particular product. These tests also require the set up of custom testing environments to capture certain elements of a product's behavior, as well as specialized software for analyzing code or inspecting captured network traffic. This led us to create processes that in several places suggest more background reading about related topics and software, and may require some higher level of background expertise. While we endeavored to make the handbook readable for the non-expert, there is no getting around the complexity of these tests requiring highly-technical documentation.
In developing the processes for the technical tests, one key area of focus was the ability for a wide variety of stakeholders to conduct testing, even with limited budgets. This requires limiting the scope of certain procedures; when in some cases it was technically possible to dig deeper, we maintained a sensitivity to cost. As researchers at a non-profit, we realized that others interested in similar product evaluation may have limited resources to use on testing. We found that much of the testing required by the Digital Standard can be done without building out a lab full of specialized equipment. The vast majority of the testing processes we describe can be conducted using an average laptop running free and open source tools.
Our testing handbook does require some dedicated hardware for testing—like a wifi network on which you are allowed to inspect all traffic—in addition, of course, to the products being tested. In most cases, a mobile device is required for testing interaction with the product, though depending on how an app is built, it may be possible to run some of the tests using virtual machine emulation instead of a physical device. Of the three products we tested in developing this handbook, only one had good support for emulation.
In some cases, we realized that more information could be obtained if we bought more expensive and specialized testing equipment. For example, while it may be possible to detect activity on a running embedded chip using special clamps designed to read electrical impulses going to the chip’s pins, it seemed out of scope for an achievable and reproducible testing process. We noted these kinds of limitations in our results.
There is a similar limitation to all of the technical tests; if other product testers apply more resources to these tests, they would have to expand their procedures beyond those covered in our handbook. This includes spending more time on things like background research, traffic analysis, and deeper app decompilation efforts, as well as spending more money on equipment. We intend this handbook to be a floor rather than a ceiling, and hope that future testers will expand upon our processes by building up their testing gear or dedicating more time to a particular procedure.
Writing a standardized technical test presented another unique challenge. The relative speed at which change occurs in tech creates a risk of testing for outdated best practices. Old best practices are constantly being swapped out for new ones, and some best practices may still be contentious. For example, recommendations about password strength have steadily moved toward increased password complexity. But how far this should go (and at what point human memory is stretched) is still an active debate among security professionals. Those are instances where a tester has to make a qualitative call on the current best practices. We noted where we made such calls in our results.
The Digital Standard’s Product Stability test is an example of both the need for qualitative analysis and the danger of documentation growing stale. This test centers on software “fuzzing,” the common testing practice of repeatedly providing known bad inputs programmatically into every place within a piece of software that accepts them. These tests are run in loops for days or more, to see under what conditions the code might crash. Fuzzing itself is a far from standardized process, and typically requires deep research of the specific code base being tested. Given the detailed per-product level of knowledge required, and the lack of accepted benchmarks for fuzzing, it is the only section of our testing handbook that is qualitative rather than pass/fail. Our handbook describes one approach to fuzzing. However, new research suggests that this process can be automated to a greater degree, and that future fuzzing efforts may well require less specific knowledge of an application's functionality. This will also make it easier to compare results across devices and platforms, as well as make the process less time intensive.
We did not limit our focus on achievability to the cost of equipment. There were tests we knew were technically possible but prohibitively difficult in some other way. In this sense, achievability meant purposefully leaving some possible examinations out of the handbook. To use a previous example, it may be possible to extract microcontroller code from running chips with special equipment; however, in addition to that equipment costing money, it may only be useful for a limited subset of products, and its purchase doesn't guarantee that the code can be meaningfully accessed anyway. For third-party testers, it is generally much harder to obtain access to those parts of a product's running code than it is to get to other parts, like an app. With no guarantee of obtaining code that can be analyzed, it did not seem worthwhile to spend resources on procedures with an unsure outcome.
We also decided to not create processes to test the interaction between devices and "home assistant" technologies. Smart home assistant products present a whole new class of attack surface. They are also mostly devices that run on closed platforms using software controlled by their manufacturer, and not the manufacturer of the IoT products being tested. This makes it harder to fully test or understand product/assistant interaction. While there is some standardization of the communication between IoT devices and home assistants, generally testers would be unable to fully control the test environment of such devices. There is also a relative difficulty in setting up open source implementations of home assistant technology, and in knowing if those implementations are feature complete when compared to more common home assistant platforms. As the landscape of IoT and home assistants changes, it may become more possible to test the interaction between consumer devices and assistants. However, for the time being we felt that in addition to technical complexity, testing with home assistants could introduce uncertainty into the results.
Finally, we chose to focus on testing apps built for Android devices even though all three of the products we tested have both Android and iOS versions of their apps. We made this choice for a number of reasons, but at the top of the list is the size of the Android-user base. While Apple products enjoy wide popularity, far more people are using apps within the Android ecosystem. On average it is also cheaper to purchase testing equipment for Android devices, and as noted above, we endeavored to keep down costs for our testing. The other top reasons are closely related to Android's open source nature: The tool sets are all freely available, and there is a much larger community of developers and researchers working with the platform. We did not make this choice because we don't think that apps for Apple iOS aren't worth testing. Rather, we wanted to focus deeply on the questions asked by the indicators, and we were able to build on the team's existing Android knowledge and the wide array of available documentation on Android systems.
Given all of these factors, the broad range of IoT devices and features, and an understanding that other possible testers may have time constraints, we made decisions in our approach to technical testing procedures that allowed us to prioritize achievability. Often this resulted in our describing the simplest test that would produce a clear result.