Testing jswzl at scale

Charlie Eriksen
Founder

January 4, 2024

JavaScript has a reputation for being quirky and inconsistent. People do weird things with it; not all parsers/engines are precisely the same. 

As I began improving the test coverage for jswzl, I faced a daunting challenge. My task was to ensure that the pre-analysis, optimization, and analysis engine of jswzl could be relied upon to function effectively with all the funky JavaScript code that people write and serve on the internet. The task seemed overwhelming, but I was determined to validate the engine's reliability in the wild-west that is the world of JavaScript.

This blog contains an overview of some of the more exciting test efforts it has taken to test jswzl, and the challenges faced. 

Step 0 - Unit testing

Of course, unit testing of the individual components is the baseline. Adding additional test cases to unit tests is trivial and gets done whenever a regression or issue is found. 

But unit testing only takes you so far. When it comes to parsing and manipulating Abstract Syntax Trees (ASTs) recursively with many transformers, weird things can happen when everything is put together. 

Step 1 - Manually annotated test corpus

We get no coverage of the interaction between components with just unit tests. The approach to address this was to create a custom wrapper around the entire analysis engine and run code through there. This will catch issues where things outright break in terms of parsing.

But we want the engine to parse the code and produce as correct analysis results as possible. The simple solution is: Using comments in a JS file, one can annotate the code with markers for what results are expected for a given expression. Normally these would be larger JS files, but for demonstration purposes here's an example I added a few days ago when I found a bug in the Call Pattern logic.

Simple example: In the first line, it should match a require statement. But not the second one, as indicated by the exclamation mark.

After parsing the code, all comments are collected and parsed to work as range markers. An AST walker can then walk the tree with these annotations and verify whether the range contains the expected result type(s).

You now have a way of defining a “source of truth” for what the analysis engine should produce, and all you need to do is annotate a source file, which is made easy with a bit of magic with some support from your IDE (Live Templates in Rider, for example):

Live templates in IntelliJ-based IDEs makes it a breeze to annotate files

Step 2 - Crawling the webs

We now have much more confidence in the engine for a set of sources we’ve taken the time to annotate. But the internet is full of weird and wonderful JavaScript, like you wouldn’t believe it! So, doing some crawling for research seems like an obvious choice. Not only does it allow us to find any sources that reveal issues in the pipeline, but putting all this data into a single project file gives us a significant data set of sources with exciting patterns to discover. 

And yes, this generates a lot of data to study! So far, I’ve had a project file go up to 220 GB. Yikes. But how? I dusted off my old workstation:

  • Ryzen 5950x (16C/32T)
  • 128GB RAM
  • 1 TB Samsung 970 Evo Plus
  • Mellanox ConnectX-3 10G SFP+

Sitting under my desk, it’s connected to a 10G network with a 1G downlink to the internet. Using gowitness proxied through Burp Suite (with the jswzl plugin installed, of course!) and the jswzl analysis engine. Now, what do we crawl?

Enter Cisco’s Popularity List. They serve a daily list of the top 1 million DNS names requested through their DNS network. The list is fed through httpx to test for port 80 and 443, which produces a list of URLs that are alive. Success!

Now, we give gowitness that file to run through. At first, everything would break even running two concurrent threads after minutes/hours. But after many rounds of performance optimizations, it eventually easily handles four concurrent threads for days on end. At least until IO becomes an issue as the project file grows. 

Before recent releases, things would dramatically slow once the project database was above ~5 GB. But now it drops off at ~200GB with this hardware. That’s big. And way bigger than is practical to work with in practice outside of a few very specific use cases. 

Step 3 - Analyzing the data

Now we’ve got a lovely 220GB project file. Project files are just plain SQLite. The first step is getting it copied over to another machine for analysis. In this case, I’ve been utilizing my workstation:

  • Ryzen 7950x (16C/32T)
  • 192GB RAM
  • Seagate FireCuda 530 4TB SSD
  • Marvell AQtion 10G

The 10G networking here does a lot of the initial heavy lifting. Who's got time for waiting? But as it turns out, the real challenge becomes reading from that large of a SQLite file. That's where most of the waiting ends up happening anyway. That in itself is worthy of a blog post of its own given the complexity of how to scale SQLite.

At any rate, we have data we can query. What do we look for? Here are some examples of how it has helped:

Analyzing source maps

Initially, jswzl only supported source maps loaded from a map file. Eventually, I had to tackle the problem of in-line source maps. But does everybody do source maps the same way?

That we could simply answer with data! Going through all the analyzed sources, I could get out anything that looked like a source map and categorize it.

A few interesting things came from that:

  • Many will pack multiple files into a single file with their own source maps. 
  • The specification does allow for both the //# and //@ annotation. But the //@ annotation was exceedingly rare. 
  • The source map is supposed to be on a line at the end of the source. But this is definitely not always the case.  

Having multiple source maps in a single file is especially interesting, which doesn’t seem to have any basis in the specification. 

Validating chunk prefetching

One of the more complex aspects of jswzl is its capability to detect chunked files from Webpack, and pre-fetch them. But Webpack does this in many different ways depending on the version and the target runtime. How about we validate this at scale?

Firstly, we know that (usually) Webpack will output a *runtime.js file, which implements the chunk fetching logic. This means we can extract all sources from files with the name pattern. 

We can then feed those files into a test corpus on which the engine runs. The results are checked to make sure that at least one chunked file reference is found and that none of the chunked file names contain the word “undefined,” as this suggests that the runtime couldn’t resolve some value. 

Utilizing Verify, it then also generates a snapshot of the output, which then serves the expected test result, making it easy to see if a change leads to the engine producing incorrect results.

Many interesting patterns of runtime files were found in the data, which all present their own unique challenges. Some of them make little sense at a glance. Here’s some of my favorite pieces of… questionable code that gave me headaches.

Why the empty string literal, and indexing into an empty object?
Why is there an assignment in the indexing?
This is just a big cascading if statement. Seems weird.
Only slightly different from above.

Lesson learned? A few:

  • Webpack does weird things quite often.
  • Testing with a large and realistic data set was huge for improving reliability. The first test run only had about a 10% success rate, which was improved to much closer to 90%+.
  • The fact that it’s not 100% is partially also a result of the fact that not all files have chunk references embedded in them and need to be manually reviewed. 

Challenges faced

This path was fraught with challenges, both good and bad.

SQLite - Balancing IO and memory

The jswzl project files are just SQLite databases. And database optimization is difficult, especially when you do silly things like storing multi-megabyte blobs. Structuring data correctly becomes a real art.

At large project sizes, IO speed becomes crucial very quickly. The alternative is to rely less on IO and keep more in memory. But even then, you can only keep so much in memory if your project file is 50GB+. Most people don’t have more than 16-32GB memory in their machines. 

Another interesting dilemma is the idea of compression. Currently, sources are compressed to minimize file size, and thus, how many pages have to be read from disk on query. 

Pros: 

  • The file size is minimized
  • Rows are smaller, which should make query performance better

Cons:

  • For sources that are loaded, there’s a compression step necessary, which is a CPU-bound task
  • Slight transient memory increases as the data is (de)compressed.  

Burp Suite

Ah, good ol’ Burp Suite. It’s a great tool, but for these purposes, it only serves one role: To proxy the traffic and ship it off to jswzl. This means we want to reduce any overhead from it as much as possible.

This is easier said than done. It required a lot of tweaking and messing around with Burp to make it stable and not eventually crash. It ended up being a combination of:

  • Using a file-backed project, which was counter-intuitive
  • Disable logging of out-of-scope requests
  • Disable all active and passive tasks
  • Disable the logger

Even then, you still see some weird behavior regarding memory allocation, which is more of an artifact of how the JVM and Windows handle virtual memory, only releasing pages as memory pressure occurs. 

Next steps?

As mentioned, working with these large datasets is barely practical or helpful for most people. But if there’s one thing in software that is always exciting to ask and answer, it’s the question: How far can we take it? How fast/efficient can we make it? 

It’s pure hubris and far more a self-indulgent exploration of what is possible. But there’s one big question that arises from this:
What would happen if one used a different database engine, like PostgreSQL, instead of SQlite? Would it allow the system to scale further? The database is currently the most significant limiting factor in scaling. SQLite only scales so far in this use-case. But could a more fully-fledged engine do better?