Note: This is Part 2 of 2. Part 1 can be found here
Overall findings of the Scanners, in alphabetical order
Pros: Accunetix was a close third behind Appscan after being trained to find every link.
Cons: Accunetix missed 53% of the vulnerabilities even after being trained to know all of the pages. As mentioned previously, on their own test site, Accuntix missed 31% of the vulnerabilities after training and 37% without training. This is a significant cause for concern as they should be aware of the links vulnerabilities on their own site and be able to crawl and attack them. These test sites are relatively small; in any site that cannot be completely crawled manually, testers should be wary of relying exclusively on Accunetix given the weakness of its crawler.
Support: The staff at Acunetix is very responsive and was helpful with keeping their test sites up and resetting them as needed. When help was needed to understand how to best train the scanner using manual crawling, they promptly provided clear documentation on how to use the various included tools to accomplish the task.
Review: Accunetix lagged the industry leaders in point and shoot mode, giving rise to concerns about running it without significant training. If it is trained to find every link, it is a close third to Appscan.
Pros: A high quality scanner with acceptable results on most sites. It performed well in ‘point and shoot’, better than all the scanners except NTOSpider.
Support: Appscan as a scanner I use regularly required little support for the study.
Review: Appscan is solid and seasoned scanning tool and while it did not top the study, it always delivers consistent and reliable results. In in Point and Shoot scanning, it came in as the clear second place solution.
Pros: As a manual pen-testing tool, it is top rated. At its price point, its hard to argue against having it in your toolkit.
Support: No official support available
Review: BurpSuite is well recognized as a best of class hacking proxy. It is a useful companion to a full commercial tool. (Note: NTOSpider has recently added integration with BurpSuite to allow users to manually dig deeper into automated findings)
Pros: The Cenzic web application security scanner has some very positive benefits to the web application security tester. It has great accuracy when trained effectively. Most of its attacks are highly customizable and configurable. It has a form training module that allows for fine grained control over the types of parameters that are submitted and can recollect these for later use. It has a modular architecture which allows different types of spidering and manual traversals to be combined with highly customized attacks. It was definitely the scanner with the most configuration options that could actually make a difference in the outcome of an assessment.
Cons: Hailstorm missed 38% of the vulnerabilities even after being trained to know all of the pages, but had missed 60% untrained. Cenzic scanner was the most challenging to configure for effective basic scans. It took 2-3 times longer to train for an effective scan as compared to most other scanners in its class. The Cenzic scanner is definitely geared for use by the more seasoned pen tester as shown from its numbers in the point and shoot category.
Support: The staff has been very helpful through the entire process, including answering calls.
Review: In general the observations of Hailstorm show that scanning even well understood and simple web applications requires a fairly knowledgeable understanding of the scanner. Human intervention is frequently required to get satisfactory results.
Pros: NTOSpider was the most accurate scanner, finding over twice as many vulnerabilities as the average competitor even without training was able to discover 92% of the vulnerabilities, compared to the closest competition which was only able to find 55%. Once trained it increased to 94%, compared to the closest competition which was only able to find 62%. Great for fully automated scans, and now has better interface and manual training support.
Cons: Still needs work on the manual training features and possibly with scan times.
Support: The staff at NT OBJECTives was very helpful and responsive during the course of this study. Given their single focus, it is fairly easy to get support from the employees who work on the technology, as opposed to navigating a help desk.
Review: As clearly the leader in terms of quality results, NTOSpider performed very well. The results make a great case for using NTOSpider as the first choice for automated scanning.
Note: For the purpose of clarity, it needs to be pointed out that the Qualys testing was done in a different manner than the other tools. See Methodology for details.
Pros: Because Qualys is a service, it is the ultimate point and shoot. You place your order and they deliver a report.
Support: Given that all that was done was to order the scans and download the results, there is no comment on Qualys’ support.
Pros: The interface for reviewing the scan data is very well designed.
Cons: Poor vulnerability finding results, and had the worst score in this review. WebInspect missed 66% of the vulnerabilities, even after being trained to know all of the pages. They missed 42% of the vulnerabilities on their own test site after being trained and 55% before training. The manual training features are overly complicated and took a number of hours to learn how to do simple tasks. During the testing it had numerous scans crash or hang, which caused delays. All of these issues point to significant problems with maintaining quality post-Spi Dynamic’s acquisition by HP.
Support: Difficult to reach anyone. Required help from colleagues and acquaintances to get questions answered.
Review: The apparent problems were very surprising for the industry market share leader. Many enterprises have been using WebInspect for years. These results bring into serious question its abilities to find the latest vulnerabilities in modern websites; users of this tool should seriously consider re-evaluating their reliance on it as a method for securing their web applications.
The scanning vendors have spent a significant amount of time discovering a range of web application vulnerabilities both by independent research and by getting information from customers. As a whole, these vendor websites create a meaningful testbed to evaluate the of web application scanners. Some vendors will have the view that this is not an optimal way of looking at things, but this is a valid baseline with well understood vulnerabilities and the results can be validated fairly straightforwardly.
Some readers of this study may inquire why scans were not performed against some of the web applications created for teaching purposes (e.g. webgoat and hackme bank). First, these were not designed to mimic the functionality of real web applications but are intended for use in teaching a human how to perform an audit. The vendor test sites are more representative of the types of behaviors they would see in the wild. Second, some of the vendors are aware that users test against these sites and have pre-programmed their tools to report the vulnerabilities that they have already discovered. It is sort of like getting a copy of the test beforehand and memorizing that the answers are d,c,b,a, etc. as opposed to learning the material. The scanner may discover vulnerabilities on these sites but this has no predictive value for how it will perform for a user in testing their own sites.
I would also like to discuss this study in light of how it relates to a normal scanner evaluation. Web scanners will obviously have different results on different websites. For this reason, it is important to test the scanners against a range of websites with different technologies and vulnerabilities. Although NTOSpider was always at or near the top, results varied greatly by web application. In order to eliminate the effects of luck with small sample sizes, I decided to have at least 100 vulnerabilities in this test. Roughly 120 hours of work, plus access to all the scanners and experts in each to help, was put into this study, which may not be an option for many enterprises. Having said that, evaluating these tools on a small sample size of vulnerabilities can be a bit of a crap shoot. This is not to say that evaluators should not try the tools in their evaluations. But their results should be considered along with industry studies. One can get a sense of the feel of the tool in an evaluation – accuracy requires a larger investment of time. This is analogous to buying a car – you might get the feel of the vehicle from driving it but you should rely on Consumer Reports for certain things that may not be apparent during the test drive such as how well the engine performs over time (and certainly the crash test results).
The results of this study will be surprising to many. Even when web application scanners are directed to the vulnerable pages of a website, there is a significant discrepancy in the number of findings. Again, these results should not be surprising given the great difficultly of achieving accurate results over an infinite target space of custom web applications. This is a lot harder problem than network scanning. These results should cause security professionals to have significant reason for concern if they are relying on one of the less accurate tools. There is a good chance that they are missing a significant number of vulnerabilities. The vulnerability results with the analysis of the time/cost involved in False Positive and False Negative findings should highlight additional areas of interest and consideration when picking a scanner. Given the large number of vulnerabilities missed by tools even when fully trained (56% when NTOSpider is eliminated from the results) it is clear that accuracy should still be the primary focus of security teams looking to acquire a tool.
The numerous crashes that I experienced with Appscan and WebInspect are also an issue that should be considered. As mentioned earlier, these are relatively small sites. The risk of a crash preventing completion of a scan will increase significantly with larger scans. The data speaks for itself, and I was surprised that my previous report was largely validated by what I saw during this analysis and I was impressed by the results of NTOSpider with an excellent rate of vulnerability discovery, low false positives and terrific automation. For manual auditing, I was very impressed with BurpSuitePro which at roughly $200 is clearly a worthy tool to have in my toolkit. The biggest disappointment had to be with HP WebInspect which performed below my expectations. These results showed that it is not the size of marketing budgets that produce a better product. Scanners with big deltas between trained and untrained results (Hailstorm, BurpSuitePro and Acunetix) can provide good results, but may require more effort to achieve them.
Response to my 2007 Study
In October 2007, I published a study, “Analyzing the Effectiveness and Coverage of Web Application Security Scanners”; in which I compared 3 commercial web application scanners, Appscan, NTOSpider and WebInpsect. The scanners were deployed in a ‘Point and Shoot’ method (i.e. I relied on their crawlers and did not point them to any areas of the websites being tested). I reported results for crawled links, application code functionality exercised (as monitored by Fortify Tracer) and vulnerability findings (both verified positive and false negatives). The results, as summarized in Appendix 2, showed that NTOSpider had far better coverage and vulnerability assessment than both Appscan and WebInspect. I believe that the findings demonstrated that because of the nature of web applications, there can be a wide divergence in scanning results based on the quality of the scanner and/or specific functionality employed by the web application being scanned. Web application scanning is a much more difficult task than network scanning because most web applications are custom and scanners must crawl and attack them like a human, as opposed to searching for signatures, as network scanners do.
There was a significant amount of criticism of the results. After discussing the 2007 paper with numerous security professionals, I believe that the paper highlighted a significant fault line within the security community. Broadly speaking, there are two groups in the web application testing community.
Group 1: Uses scanners in a more or less ‘point and shoot’ manner and relies on the scanners’ crawler and automation to exercise the site’s functionality within minimal or no human guidance. Their reasons for this include 1) they lack the time to spend training the scanner, 2) they want a repeatable result for audit purposes that is separate from the skill of a particular tester and 3) they believe that point and shoot results are sufficient to achieve the level of security testing on websites of the complexity that they are testing.
Group 2: Believes that scanning in a point and shoot manner is insufficient. They feel that given the complexity of modern websites, no automated tool can do an adequate job of testing a website without substantial human guidance. They often believe that scanners should be an adjunct to human testing and should be used to run a large number of easy attacks to get easy to find vulnerabilities (“low hanging fruit”) and that human testers are required to get more difficult to find vulnerabilities. Members of Group 2 were the strongest critics of my original study. Without opening up this can of worms again, I think that it is important to note that it is, in a sense, a pointless debate because regardless of the merits of either side, testers are going to fall into Group 1 or Group 2 or somewhere in the middle depending on their needs and skill sets. The point of this follow-up study is to address a criticism of Group 2. Group 2 argued that the 2007 study was not useful because I did not train the scanners (i.e. walk them through the websites that I scanned). If I had done this they claim that my results would have been different. This is certainly theoretically possible and was part of the impetus behind this second study.S|A
The full study with appendicies and graphs can be downloaded here.
Larry Suto can be reached using his name with a dot in the middle and sending it to gmail.