Why I didn't implement a npm search-site.

In my last post, I wrote that the search function of npmjs.com is not as good as I would wish for and that I had to click through with a large number of packages that are appearently unfinished and fallow.

I said, it would be nice to have a better tool: Something that filters packages or has a better ranking, so that the high quality packages appear in my results first. It would greatly reduce the time, I spend for finding adequate packages and it would greatly reduce my frustration.

This blog post is about how I tried to implement such a search myself before I found out that npmsearch.com already did the same thing. They did not do the exact way, but good enough for me to stop my efforts for now.

However, I would like to share the experiences I made on the way, and the tools I came across.

Tools and engines

npmjs.com is backed by CouchDB. I am not a CouchDB expert, but it seems to be the wrong tool for complex searches. You can store JSON and you can create views via JavaScript. A view is like a transformation of the database. It basically iterates over the data and executes a map-function for each document. The function can then emit one or more representations of the document along with an indexing key. The emitted documents will be stored in sorted tree (sorted by their key), which allows fast queries. As new documents are created, the view is automatically updated.

Using JavaScript as language makes creating views very flexible. The problem is, that once you have created a view, you have only very limited abilities of using it. You can query a part of the view, limited by a startkey and an endkey, but as far as I can see, there is no way of sorting the documents other than by the sortkey.

There might be a way to do that using a good reduce function. The Lucene Engine achieves this by iterating all documents from an index and maintaining a fixed-size heap-structure in which the top documents are stored sorted by their scores. Something similar could certainly be done in CouchDB, but why should I do that if somebody else (e.g. Lucene) has already done it before?

Speaking of Lucene (which is a greate library for performing full-text searches): ElasticSearch is an Open-Source search-engine based on Lucene. It provides a REST-based interface and seems to be a good and scalable choice when it comes to performing complex searches.

So I thought: Why not replicate the whole registry into an ElasticSearch-index and use that to perform the search (e.g. to rank down 0.x.x packages that have not had updates recently)?

Replicating a CouchDB into ElasticSearch

In earlier versions of ElasticSearch there was a mechanism called River. It has been deprecated since version 1.5.0, which is why I didn't take a deeper look. Just so much: Rivers are a mechanism for letting data flow into ElasticSearch and there were a couple of plugins for different datasources. One of these plugins allowed you to connect use a CouchDB-instance as datasource.

However, the README-file of the plugin project on github told me that I should use the logstash couchdb changes input instead.

Logstash

Logstash is a tool written in Ruby, published by Elastic who also created ElasticSearch. According to Elastic, it was written as a tool to aggregate logfiles and events from a variety of sources and analyze them at a central location.

It can collect data from several sources (such as logfiles or, in my case, a CouchDB instance) transform it and post it into an ElasticSearch-index to search it afterwards.

After a little research, I came up with a simple configuration that copies data from my locally replicated npm-registry into an ElasticSearch-index. It is mostly copied and modified from examples in the documentation,

input {  
    couchdb_changes {
        db => "npm-registry"
    }
}
output {  
    elasticsearch {
        protocol => "http"
        index => "npm-registry"
    }
}

logstash -f config.conf will initiate the transfer. But there is a slight problem: ElasticSearch is not well suited for the format of documentes in the npm-registry. It automatically derives a data schema for documents contained in the index and maps the fields onto Lucene-documents. That's why it works best with objects like:

{
    firstname: "Nils",
    lastname: "Knappmeier",
    location: {
        city: "Darmstadt",
        country: "Germany"
    }
}

This object has a well-defined set of keys. A key explicitly defines the meaning of its value. It is cleary, that all properties location.city semantically contain the same kind of data. A document of the npm-registry on the other hand contains dynamic objects. An example is the dependencies property, which has the form

{
    dependencies: {
        "lodash": "^4.0.0",
        "q": "^1.4.1"
    }
}

The keys of this object don't represent the semantics of the value, but they are values themselves. In order to use this object in ElasticSearch in a meaningful way, it has to be converted like

{
    dependencies: [
        { name: "lodash", "version": "^4.0.0" },
        { name: "q", version: "^1.4.1" }
    ]
}

This could be done in logstash using filter plugins, but none of the existing plugins was flexible enough to perform such a conversion. So before I started to learn Ruby and implement my own plugin, I started to look elsewhere.

npmsearch.com

When I looked for methods of transforming the CouchDB documents for ElasticSearch, I came across this blog post by Diana Thayer on orchestrate.io which is basically about the same thing. Among other things, she writes about the transformation problem, points to npm-normalize and to http://npmsearch.com. So far, I haven't tried the npm-normalize myself yet, so I cannot tell how close it would come to solve my own problems. However, the github account solids which contains npm-normalize also has a package npm2es which, in its README, tells me that its doing exactly what I wanted to do. Replicate the npm-registry into an ElasticSearch-index and update the index when changes to the registry are published in the _changes-stream. Well that's great, isn't it. Problem solved.

But... I really wanted to setup this site and now I've learned that it already exists...

Somebody has already done it!

Have you ever had the feeling that, no matter what you want to do, somebody has already done it? You have a great idea, you play around, you start implementing things, learn about new technologies. And then you find out, that somebody has already had the same problem, two years ago and has implemented a solution. It's a bit frustrating, but considering the total number of developers on our planet, it's not a big surprise.

A short evaluation

I went to http://npmsearch.com and entered "documentation generator" into the search field.

A couple of months ago, I used this term among a lot of others to look for tool to generate markdown-documentation for my projects. I had to look through a lot of results until I finally found the very promising tool verb. This tool is on rank 14 of my search results on http://npmsearch.com

The source code of the project is available on github.

What's left to do?

No, this project seems to be at a dead-end. I might start playing around with npm2es and implement a different search site with different metrics. But I won't do that right now, because the http://npmsearch.com seems to be pretty good. I might just simply use it and see if it's really that. There are other interesting things to spend time with.

If you are annoyed with the search on npmjs.com, I'd recommend you give npmsearch.com a try.