github twitter linkedin email rss
Node.js Streams applied - node-dirbuster case
Apr 18, 2015
12 minutes read

When starting a new penetration test, one of the first things a security specialist would have to do is identifying and diagnosing the target. This process consists in several substeps, from enumeration of hosts, technologies used, type of application or service, its users, business logic, spelunk open sources of intelligence and many others. This is known as recon.

One of the tools we can use for recon is DirBuster, an Web Application endpoint enumeration tool, now part of the OWASP ZAP Attack Proxy.

DirBuster is a multi threaded java application designed to brute force directories and files names on web/application servers.

In essence, what DirBuster does is grabbing a list of words, “throws” them at a Web Server as possible paths and sees the response to conclude if there is a resource available for each that path. Using this strategy, it is common to find URLs that weren’t linked (and therefore not caught by a crawler), that can contain admin pages, old service endpoints with bad Auth, etc, exposing interesting information to the pentester. A simple visual representation for what DirBuster does is:

┌─────────┐   ┌─────────┐   ┌─────────┐
│ List of │   │ Attempt │   │ Output  │
│  words  ├──>│ request │──>│ finds   │
└─────────┘   └─────────┘   └─────────┘

If you have been doing Node.js for a while, you might be thinking ‘Oh, this looks like a very good use case for Node.js stream’ or at least that is what I thought :)! A dirbuster implementation in Node.js could levarage Node.js streams, and in fact, dirbuster has some interesting stream scenarios that go out of the Node.js core streams scope. As a note, my intention to build a Node.js dirbuster implementation had no specific goal of replacing the original dirbuster, instead, I wanted to know how natural would it be to use streams for this use case and more importantly, create a show case of different stream plumbing options we have available. I thank you in advance the time you are taking to read this blog post, I hope it helps in some way, any feedbacks or improvements are welcome.

Node.js streams

Streams are one of Node.js’ primer, they are part of the core API, offering an way to control flow and to avoid backpressure.

If you aren’t familiar with Node.js streams API, I totally recommend you to check it out, I won’t cover the basics in this blog posts since a lot of people have done that and with a far better performance that any tutorial I could write.. To get inspired, I recommend watching the following two talks, they make an awesome job explaining the amazing powers of streams:

There is also a nodeschool for Streams, named Streams-Adventure and a book.

Breaking down dirbuster architecture

To implement dirbuster in Node.js, we need to be able to: - Parse a file with a list of words - Identify which paths are dirs on the target Web Server - Test every path to find which of them represent a resource in the Web Server - Transform the output in the desired format (txt, json, xml, etc) - Export the output

These requirements can be visualized in the following architecture graph:

┌─────────┐  ┌────────┐  ┌──────────┐  ┌─────────┐  ┌────────┐
│Generator│  │Look for│  │Collectors│  │Convert  │  │Export  │
│streams 1│─>│dirs 2  │─>│         3│─>│results 4│─>│stream 5│
└─────────┘  └────────┘  └──────────┘  └─────────┘  └────────┘
  1. Where the word list comes from, we can use a pre generated word list or a fuzzer to generate intelligent(or not) string.

  2. Check which of the words are actually directories and if they are, recursively test for paths inside that directory.

  3. Collectors are streams that will transform the paths into resource found “true/false” output. This can be done through the various HTTP types of requests (GET, POST, PUT, DELETE, etc)

  4. To give some flexibility and utility to the results, we want to be able to export the results in different forms, from csv, json, xml, etc.

  5. Simple export to a file, STDOUT or another stream used by another program.

Let’s now analyse each of these artefacts and understand how they can be achieved using Node.js Streams.

1 Generator streams

The first thing we need for our dirbuster project to work is a word list, this word list can be generated or read from a source that made the job of generation it for us. So in essence, our box number 1 should look like this:

┌ ─ ─ ─ ─ ─ ─ ┐
││file       ││
 │           │
 ┌───────────┐      stream
││random gen ││   interface
 │(fuzzer)   │ ───────────▶
││other      ││
 │source     │
 ─ ─ ─ ─ ─ ─ ─

This can be achieved using a Readable Stream, the perfect stream when we want to flow data from a resource into our program.

To create a Readable Stream from a file, we can do as follows:

var fs = require('fs');
var wordListStream = fs.createReadStream(<path_to_file>);

However, we can see that to get our word list into our program, there is a little more stream sauce added to it. Check that in the gen-streams/createList.js, we pipe it to a liner stream and a cleaner stream. The liner stream is a Transform Stream that will grab the data generated by the word list stream and make that every word is partitioned in a different data event (breaking it by the new line character). As for the cleaner stream, it is also a Transform stream that will remove any line that is a comment or an empty line, so that we don’t move unnecessary content to our program.

2 Look for directories

To build the dir testing component, we need to build a “stream feedback loop”, this might not look natural at first, but let me illustrate, we need:

  1. Test for every word available, its possible dir variance (e.g for /js, test /js/)
  2. Each time a dir is found (a response with a status code other than 404), let the program know it has to test for every word on that path (e.g /js/a, /js/code, etc) and also test for remaining dirs inside the dir found (recursively).

It is hard to put down in words, so let’s see it in a graph:

               ┌──────────┐  ┌─────────┐
               │Add dir as│  │Generator│
            ┌──│a prefix  │<─│Stream   │<┐
            │  └──────────┘  └─────────┘ │
┌─────────┐ │  ┌─────────┐  if a dir     Λ
│Generator│ │  │Test for │  is found    ╱ ╲
│Stream   │─┴─>│directory│────────────>▕ + ▏
└─────────┘    └─────────┘              ╲ ╱

As we can see on the graph, each time a dir is found, we prefix a Generator Stream with that dir and pipe it to the Test dir stream.

In order for this to work properly, we need to understand a little bit better how the .pipe system works between streams and how we can avoid having streams closing on us too son.

.pipe is a mechanism to attach two streams together, this means that in a scenario that we have a.pipe(b), for every data event that stream a emits, a .write() will be done to stream b. This also offers a backpressure mechanism, if stream b is unable to take any more writes, it calls automatically the a.pause() function, so that a has to wait until the drain event gets emited by stream b. This is a great feature that Node.js offers, however it might backfire if we are not cautious enough, since we can pipe several streams into the same stream, (e.g a.pipe(b);c.pipe(b)) and as soon as one finishes, emitting the end event, the stream that we piped into, will close also, making any of the others streams that still had data to pipe, to fail. Nevertheless, there is a solution to it, by controlling the end of stream ourselves, let’s analyse this with the help of the following graph.

                         ┌──────────┐         ┌─────────┐
                         │Add dir as│         │Generator│
                      ┌──│a prefix  │<────────│Stream   │<───┐
                      │  └──────────┘         └─────────┘    │
                      │.pipe({end: false})                   │
                      │                                      │
┌─────────┐           │         ┌─────────┐    if a dir      Λ
│Generator│           │         │Test for │    is found     ╱ ╲
│Stream   │───────────┴────────>│directory│───────────────>▕ + ▏
└─────────┘ .pipe({end: false}) └─────────┘                 ╲ ╱
                                    .on('drain')             V

The .pipe({end: false}) tells our streams to not emit the end event when it ends, otherwise our Test for directory stream would close as soon as one of them would finish and since we can’t tell for sure how many we will have (each website will have their set of directories), we have to find another strategy to know when to close the stream.

We will use the drain event, this event will fire every moment that the Test for directory stream has no more data to execute, which can mean only one of two things, 1: there are no paths to be checked or 2: we are in the processes of creating a new prefix stream, so we have to wait just a bit more for more paths to be checked. In order to understand this programmatically, we will check each time a drain event fires, if there are more prefix streams available by checking a counter. This counter is incremented each time we find a dir and decremented each time the .on('end') event is fired.

Before jumping into the collectors stream, we only need to add a glue-stream to join both a generator stream and all of the prefix streams created during this process, so that the collectors only have to deal with a normal stream interface.

3 Collectors

Ok, so now we have all the paths we need to test, it is time to make some requests to see what we can find. We might be interested in knowing which requests work with different HTTP verbs, so it is usefull to enable the user to pick the method or methods.

Visually, we should have something like this:

┌──────┐      ┌ ─ ─ ─      ┌─────┐
│path  │       ┌────┐│     │funil│
│stream│───┬──>│HEAD│──┬──>│     │
└──────┘   │   └────┘│ │   └─────┘
           │  │┌────┐  │
           ├──>│GET │├─┤
           │  │└────┘  │
           │   ┌────┐│ │
           ├──>│PUT │──┤
           │   └────┘│ │
           │  │┌────┐  │
           │  │└────┘  │
           └──> ...  ├─┘
               ─ ─ ─ ┘

Similar to the issue we faced on “2”, we need to merge the several collector streams into one, but without having one to close the destination stream too soon. We could use the same strategy, but since this time we know in advance how many streams are created, we can use a module developed by Igor Soarez, specifically for node-dirbuster (Thanks Igor! :D), but that you can use for your projects.

The module is funil, the portuguese word for “funnel”, which is a good analogy for what it accomplishes, we pipe several streams into one, and it will end when all the streams end. This also creates again a single stream interface that we can use next, instead of having to teach our convertors to adapt for several scenarios.

4 Convert/filter results

Results will arrive at in the form of JSON, so in the case of our desired output being JSON, there is little to do, however, we can always pipe a Transform stream to convert it to XML, csv, etc. Compared to the artefacts, this one is quite straight forward, simply go by:

┌──────┐                   ┌──────┐
│funil │       ┌────┐      │out   │
│      │──────>│JSON│─────>│stream│
└──────┘   │   └────┘  │   └──────┘
           ├ ─>│XML │─ ┤
           │   ┌────┐  │
            ─ >│CSV │─
           │   └────┘  │
           ├ ─>│TSV │─ ┤
           └ ─> ...  ─ ┘

5 Export stream

The export stream, the last piece of node-dirbuster, is simply a Writable Stream, where the Convertor Stream can write its results.

Final view of the architecture

Joining all the pieces together, we have:

┌─────────┐                                ┌──────┐    ┌ ─ ─ ─     ┌─────┐                  ┌──────┐
│Generator│   .pipe({end: false})          │path  │     ┌────┐│    │funil│       ┌────┐     │out   │
│Stream   │──────────────┬────────────────>│stream│──┬─>│HEAD│──┬─>│     │─ ─ ──>│JSON│────>│stream│
└─────────┘              │ .pipe({         └──────┘  │  └────┘│ │  └─────┘   │   └────┘  │  └──────┘
                         │   end: false              │ │┌────┐  │                ┌────┐  │
                         │ })                        ├─>│GET │├─┤            ├ ─>│XML │─ ┤
                         │                           │ │└────┘  │                └────┘
                         │                           │  ┌────┐│ │            │   ┌────┐  │
                         │  .on('end')               ├─>│PUT │──┤             ─ >│CSV │─
                   ┌──────────┐      ┌─────────┐     │  └────┘│ │            │   └────┘  │
                   │Add dir as│      │Generator│     │ │┌────┐  │                ┌────┐
                ┌──│a prefix  │<─────│Stream   │<─┐  ├─>│POST│├─┤            ├ ─>│TSV │─ ┤
                │  └──────────┘      └─────────┘  │  │ │└────┘  │                └────┘
                │.pipe({end: false})              │  └─> ...  ├─┘            └ ─> ...  ─ ┘
                │                                 │    │
┌─────────┐     │         ┌─────────┐ if a dir    Λ     ─ ─ ─ ┘
│Generator│     │         │Test for │ is found   ╱ ╲
│Stream   │─────┴────────>│directory│──────────>▕ + ▏
└─────────┘.pipe({        └─────────┘            ╲ ╱
             end: false       .on('drain')        V

"Stream all of the things!"

Final remarks

node-dirbuster impl isn’t fully complete, things like CLI, etc aren’t there, simply because I’m not sure if anyone would have a usage for this project since there is a Java impl, if there is, let me know on twitter or github and I will be happy to continue and finish it. I’m also happy to accept PR.

One of the things I found during the development of node-dirbuster was the bottleneck caused by the assumption of having by default a http connection pool. Node.js uses a smart http connection pool o multiplex requests, avoiding to establish new connections each time it requests something to a server, although this is really smart and saves us a lot of resources/time, for node-dirbuster scenario, were a bunch of requests will be returning HTTP 404 Not Found errors, and where the default behaviour for a web server in this scenario is to kill the connection, what this means is that all the other requests that are being multiplexed in the same connection will fail, this makes us need to reattempt the request, adding overhead and time. The current implementations sets the pool option to false, meaning that a new TCP socket is open for each request, however this is not the ideal best perform scenario. An possible solution would be to use TCP directly and as long as the TCP connection would remain up, we would use that connection, while preparing others in the backgroun, so that we total time cost of opening connections was equal to open one connection only (since all the others would be done in parallel and in the background).

It turns out that this was a bug in the HTTP implementation, now fixed. Source ref:

Big thank you for reading this. Feedback is welcome :). You are awesome!

Back to posts