Wednesday, September 17, 2014

doxli - a help utility for node modules on command line

Quite often I fire up the node REPL and pull in some modules I've written to use on the command line. Unfortunately I often forget the exact way to call the various functions in those modules (there are a lot) and end up doing something like foo.dosomething.toString() to see the source code and recall the function signature.

In the interest of making code as "self-documenting" as possible,  I wrote a small utility that uses dox to provide help for modules on the command line. It adds a help() function to a module's exported methods so you can get the dox / jsdoc comments for the function on the command line.

So now foo.dosomething.help() will return the description, parameters, examples and so on for the method based on the documentation in the comments.

It's still a bit of a work in progress, but it works nicely - provided you actually document your modules with jsdoc-style comments.

All the info is here: https://www.npmjs.org/package/doxli

Sunday, September 7, 2014

REST API Best Practices 4: Collections, Resources and Identifiers

Other articles in this series:
  1. REST API Best Practices: A REST Cheat Sheet
  2. REST API Best Practices: HTTP and CRUD
  3. REST API Best Practices: Partial Updates - PATCH vs. PUT
RESTful APIs center around resources that are grouped into collections. A classic example is browsing through the directory listings and files on a website like http://vault.centos.org/. When you browse the directory listing, you can click through a series of folders to download files.  The folders are collections of CentOS resource files.



In Rest, collections and resources are accessed via HTTP URI's in a similar way:

members/ -- a collection of members
members/1 -- a resource representing member #1
members/2 -- a resource representing member #2

It may help to think of a REST collection as a directory folder containing files, although its highly unlikely that the member data is stored as literal JSON files on the server. The member data should be coming from a database, but from the perspective of a REST API, it looks similar to a directory called "members" that contains a bunch of files for download.

Naming collections


In case it's not obvious already, collection names should be nouns. Use the plural form for naming collections. There's been some debate over whether collection names should be plural (members/1) or singular (member/1). The plural form seems to be most widely used.

Getting a collection


Getting a collection, like "members" may return
  1. the entire list of resources as a list of links, 
  2. partial representations of each resource, or 
  3. full representations of all the resources in the collection. 
Our classic example of browsing online directories and files uses approach #1, returning a list of links to the files. The list is formatted in HTML, so you can click on the hyperlink to access a particular file.

Approach #2, returning a partial representation (ie. first name, last name) of all resources in a collection is a more pragmatic way of returning enough information about the resources in a collection for the end user to make a selection to request further details, especially if the collection can contain a lot of resources. Actually, the directory listings on a website like http://vault.centos.org/ display more than just the hyperlink. They include additional meta-data like the last-modified timestamp and file size, as well.  This is helpful for the end-user who's looking for an up-to-date file and wants to know how long it will take to download. It's a good example of returning just enough information about the resources for the end-user to be able to make a selection.

With approach #3, if a collection is small, you may want to return the full representation of all the resources in the collection as a big array.  For large collections, it isn't practical, however. Github is the only RESTful API example I've seen that actually returns a full representation of all resources when you fetch the collection. I wouldn't consider  #3 to be a "best practice", or recommend it for most use cases, but if you know the collection and resources will be small, it might be more effective to fetch the whole collection all at once like this.

The best practice for fetching a collection of resources, in my opinion, is #2: return a partial representation of the resources in a collection with just enough information to facilitate the selection process, and be sure to include the URL (href) of each resource where it can be downloaded from.

Only when a collection is guaranteed to be small and you need to reduce the performance impact of making multiple queries, consider bending the rules with approach #3 to return all the resources in one fell swoop.

Here's a practical example of fetching the collection of members using approach #2.

Request

GET /members
Host: localhost:8080

Response

HTTP/1.1 200 OK
Content-Type: application/json; charset=utf-8

[
  {
    "id": 1,
    "href": "/members/1",
    "firstname": "john",
    "lastname": "doe"
  },
  {
    "id": 2,
    "href": "/members/2",
    "firstname": "jane",
    "lastname": "doe"
  }
]

In this example, some minimal information is returned about each of the members: first and last name, id, and the "href" URL where the full representation of the member resource can be downloaded.


Getting a resource


Getting a specific resource should returns the full representation of that resource from the URL that contains the collection name and the ID of the specific resource you want.

Resource IDs


RESTful resources have one or more identifiers: a numerical ID, a title, and so on. Common practice is for every resource to have a numeric ID that is used to reference the resource, although there are some notable exceptions to the rule.

Resources themselves should contain their numerical ID; the current best practice is for this to exist within the resource simply as an attribute labelled "id". Every resource should contain an "id"; avoid using more complicated names for resource identifiers like "memberID" or "accountNumber" and just stick with "id". If you need additional identifiers on a resource, go ahead and add them, but always have an "id" that acts as the primary way to retrieve the resource. So, if a member has "id" : 1, it should be fairly obvious that you can fetch his details at the URL "members/1".

An example of fetching a member resource would be:

Request

GET /members/1
Host: localhost:8080

Response

HTTP/1.1 200 OK
Content-Type: application/json; charset=utf-8

{
  "id": 1,
  "href": "/members/1",
  "firstname": "john",
  "lastname": "doe",
  "active": true,
  "lastLoggedIn": "Tue Sep 16 2014 08:37:42 GMT-400 (EDT)",
  "foo": "bar",
  "fizz": "buzz",
  "qux": "doo"
}

Beyond simple collections


Most of the examples you see online are fairly simple, but practical data models are often much more complex.  Resources frequently contain sub-collections and relationships with other resources. API design in this area seems to be done in a mostly ad-hoc manner,but there are some practical considerations and trade-offs when designing APIs for more complex data models, which should be covered in the next post.

Thursday, August 21, 2014

Defensive Shift - Turning the Tables on Surveillance

Like many people lately, I've been pondering the implications of pervasive surveillance, "big data" analysis, state-sponsored security exploits, and the role of technology in government. For one thing, my work involves a lot of the same technology: deep packet inspection, data analysis, machine learning and even writing experimental malware. However, instead of building tools that enable pervasive government surveillance, I've built a product that tells mobile smartphone users if their device, or a laptop connected to it, has been infected with malware, been commandeered into a botnet, or come under attack from a malicious website, and so on.  I'm happy to be working on applying some of this technology in a way that actually benefits regular people. It feels much more on the "good side" of technology than on the bad side we've been hearing so much about lately.

Surveillance of course has been in the news a lot lately, so we're all familiar with the massive betrayal of democratic principles by governments, under the guise of hunting the bogeyman. It's good that people are having conversations about reforming it, but don't expect the Titanic to turn around suddenly. There's far too much money and too many careers on the line to just shut down the leviathan of pervasive surveillance overnight. It will take time, and a new generation of more secure networking technologies.

Big data has also been in the news in some interesting ways: big data analysis has been changing the way baseball is played! CBC's David Common presents the story [1]:

http://www.cbc.ca/news/world/how-the-defensive-shift-and-big-data-are-changing-baseball-1.2739619

Not everyone is happy with the "defensive shift" - the process of repositioning outfield players based on batting stats that tell coaches how likely a batter is to hit left or right, short or long.  Longtime fans feel it takes away from the human element of the game and turns it into more of a science experiment.

I tend to agree.  And to be honest, until now deep traffic inspection, big data analysis, surveillance, and definitely state-sponsored hacking, have quite justifiably earned a reputation as, well, repugnant to any freedom-loving, democracy-living, brain-having person. Nevertheless, as powerful as big data analytics, machine learning, and network traffic analysis are, and as much as they have been woefully abused by our own governments, I don't think we've yet begun to see the potential for good that these technologies could have, particularly if they are applied in reverse to the way they're being used now.

Right now we're in a position where a few privileged, state-sponsored bad actors are abusing their position of trust and authority to turn the lens of surveillance and data analysis upon ordinary people, foreign business competitors[2], jilted lovers [3], etc.  The sea change that will, I think, eventually come is when the lens of technology slowly turns with relentless inevitability onto the government itself, and we have the people observing and monitoring and analyzing the effectiveness of our elected officials and public servants and their organizations.

How do we begin to turn the tables on surveillance?

Secure Protocols

As I see it, this "defensive shift" will happen due to several factors. First, because the best and brightest engineers - the ones who design the inner workings of the Internet and write the open-source software used for secure computing - are on the whole smart enough to know that pervasive surveillance is an attack and a design flaw [4], are calling for it to be fixed in future versions of Internet protocols [5], and are already working on fixing some of the known exploits [6].

One of the simplest remedial actions available right now for pervasive surveillance attacks is HTTPS, with initiatives like HTTPS Now[9] showing which web sites follow good security practices, and tools like HTTPS Everywhere[10], a plugin for your web browser that helps you connect to websites securely. There is still work to be done in this area, as man-in-the-middle attacks and compromised cryptographic keys are widespread at this point - a problem for which perfect forward secrecy[11] needs to become ubiquitous. We should expect future generations of networking protocols to be based on these security best practices.

Some people say that creating a system that is totally secure against all kinds of surveillance, including lawful intercept, will only give bad people more opportunity to plan and carry out their dirty deeds.  But this turns out not to be true when you look at the actual data of how much information has been collected, how much it all costs, and how effective it's actually been.  It yields practically nothing useful and is almost always a "close the barn door, the horse is out!" scenario. This, coming from an engineer who actually works in the area of network-based threat analysis, by the way.

Open Data

Second, the open data movement. Its not just you and I who are producing data-trails as we mobe and surf and twit around the Interwebs.  There's a lot of data locked up in government systems, too.  If you live in a democracy, who owns that data? We do. It's ours. More and more of it is being made available online, in formats that can be used for computerized data analysis.  Sites like the Center for Responsive Politics' Open Secrets Database [8], for example, shed a light on money in politics, showing who's lobbying for what, how much money they're giving, and who's accepting the bribes, er, donations.

One nascent experiment in the area of government open data analysis is AnalyzeThe.US, a site that let's you play with a variety of public data sources to see correlations. Warning - it's possible for anyone to "prove" just about anything with enough graphs and hand-waving. For real meaningful analysis, having some background in mathematics and statistics is a definite plus, but the tool is still super fun and provides a glimpse of where things could be going in the future with open government.

Automation

Third, automation. There's still a long way to go in this area, but even the slowness and inefficiency of government will eventually give way to the relentless march of technology as more and more systems that have traditionally been mired in bureaucratic red tape become networked and automated, all producing data for analytics. Filling in paper forms for hours on end will eventually be as absurd for the government to require as it would be for buying a book from Amazon.

With further automation and data access, the ability to monitor, analyze and even take remedial action on bureaucratic inefficiencies should be in the hands of ordinary people, turning the current model of Big Brother surveillance on its head. Algorithms will be able to measure the effectiveness of our public services and national infrastructures, do statistical analysis, provide deep insight and make recommendations. The business of running a government, which today seems to be a mix of guesswork, political ideology and public relations management, will start to become less of a religion and more of a science, backed up with real data. It won't be a technocracy - but it will be leveraging technology to effectively crowd-source government.  Which is what democracy is all about, after all.


[1] http://www.cbc.ca/news/world/how-the-defensive-shift-and-big-data-are-changing-baseball-1.2739619
[2] http://www.cbc.ca/news/politics/why-would-canada-spy-on-brazil-mining-and-energy-officials-1.1931465
[3] http://www.cnn.com/2013/09/27/politics/nsa-snooping/
[4] http://tools.ietf.org/html/rfc7258
[5] http://techcrunch.com/2013/10/11/icann-w3c-call-for-end-of-us-internet-ascendancy-following-nsa-revelations/
[6] https://www.fsf.org/blogs/community/gnu-hackers-discover-hacienda-government-surveillance-and-give-us-a-way-to-fight-back
[7] AnalyzeThe.US
[8] https://www.opensecrets.org/
[9] https://www.httpsnow.org/
[10] https://www.eff.org/https-everywhere
[11] http://en.wikipedia.org/wiki/Forward_secrecy#Perfect_forward_secrecy

Thursday, August 14, 2014

Repackaging node modules for local install with npm


If you need to install an npm package for nodejs from local files, because you can't or prefer not to download everything from the  npmjs.org repo, or you don't even have a network connection, then you can't just get an npm package tarball and do `npm install <tarball>`, because it will immediately try to download all it's dependencies from the repo.

There are some existing tools and resources you can try:

  • npmbox - https://github.com/arei/npmbox
  • https://github.com/mikefrey/node-pac
  • bundle.js gist -  https://gist.github.com/jackgill/7687308
  • relevant npm issue - https://github.com/npm/npm/issues/4210

I found all of these a bit over-wrought for my taste. So if you prefer a simple DIY approach, you can simply edit the module's package.json file, and copy all of its dependencies over to the "bundledDependencies" array, and then run npm pack to build a new tarball that includes all the dependencies bundled inside.

Using `forever` as an example:
  1. make a directory and run `npm init; npm install forever` inside of it
  2. cd into the node_modules/forever directory
  3. edit the package.json file
  4. look for the dependencies property
  5. add a bundledDependencies property that's an array
  6. copy the names of all the dependency modules into the bundledDependencies array
  7. save the package.json file
  8. now run `npm pack`. It will produce a forever-<version>.tgz file that has all it's dependencies bundled in.



Thursday, May 29, 2014

JavaScript's Final Frontier - MIDI

JavaScript has had an amazing last few years. Node.JS has taken server-side development by storm. First person shooter games are being built using HTML and JavaScript in the browser. Natural language processing and machine learning are being implemented in minimalist JavaScript libraries. It would seem like there's no area in which JavaScript isn't set blow away preconceptions about what it can't do and become a major player.

There is, however, one area in which JavaScript - or more accurately the web stack and the engines that implement it - has only made a few tentative forays.  For me this represents a final frontier; the one area where JavaScript has yet to show that it can compete with native applications. That frontier is MIDI.

I know what you're probably thinking. Cheesy video game soundtracks on your SoundBlaster sound card. Web pages with blink tags and bad music tracks on autoplay. They represent one use case where MIDI was applied outside of its original intent. MIDI was made for connecting electronic musical instruments, and it is still very much alive and well. From lighting control systems to professional recording studios to GarageBand, MIDI is a key component of arts performance and production. MIDI connects sequencers, hardware, software synthesizers and drum machines to create the music many people listen to everyday. The specification, though aging, shows no signs of going away anytime soon. It's simple and effective and well crafted.

It had to be. Of all applications, music could be the most demanding. That's because in most applications, even realtime ones, the exact timing of event processing is flexible within certain limits. Interactive web applications can tolerate latency on their network connections. 3D video games can scale down their frames per second and still provide a decent user experience. At 30 frames per second, the illusion of continuous motion is approximated. The human ear, on the other hand, is capable of detecting delays as small as 6 milliseconds. For a musican, latency of 20ms between striking a key and hearing a sound, would be a show-stopper. Accurate timing is essential for music performance and production.

There's been a lot of interest and some amazing demos of Web Audio API functionality.  The Web MIDI API, on the other hand, hasn't gotten much support.  Support for Web MIDI has landed in Chrome Canary, but that's it for now.  A few people have begun to look at the possibility of adding support for it in Firefox.  Until the Web MIDI API is widely supported, interested people will have to make due with the JazzSoft midi plugin and Chris Wilson's Web MIDI API shim.

I remain hopeful that support for this API will grow, because it will open up doors for some truly great new creative and artistic initiatives.

Wednesday, May 7, 2014

REST API Best Practices 3: Partial Updates - PATCH vs PUT

This post is a continuation of REST API Best Practices 2: HTTP and CRUD, and deals with the question of partial updates.

REST purists insist that PATCH is the only "correct" way to perform partial updates [1], but it hasn't reached "best-practice" status just yet, for a number of reasons.

Pragmatists, on the other hand, are concerned with building mobile back-ends and APIs that simply work and are easy to use, even if that means using PUT to perform partial updates [2].

The problems with using PATCH for partial updates are manifold:
  1. Support for PATCH in browsers, servers and web application frameworks is not universal. IE8, PHP, Tomcat, django, and lots of other software has missing or flaky support for it. So depending on your technology stack and users, it might not even be a valid option for you.
  2. Using the PATCH method correctly requires clients to submit a document describing the differences between the new and original documents, like a diff file, rather than a straightforward list of modified properties. This means the client has to do a lot of extra work - keep a copy of the original resource, compare it to the modified resource, create a "diff" between the two, compose some type of document showing the differences, and send it to the server. The server also has more work to apply the diff file. 
  3. There's no specification that says how the changes in the diff file should be formatted or what it should contain, exactly. The RFC simply says:
    "With PATCH, however, the enclosed entity contains a set of instructions describing how a resource currently residing on the origin server should be modified to produce a new version."
    There are some interesting recommendations emerging like JSON Patch [3], but at this point it seems mainly up to each developer to figure out their own way of using PATCH.
Using PUT for partial updates, however, is pretty simple, even if it doesn't conform strictly to the concept of Representational State Transfer.  So a fair number of programmers happily use it to implement partial updates on back-end mobile API servers. It's fair to say that when developing an API, a pragmatic approach that focuses on the needs of mobile client applications is completely reasonable.

So what are the current "best practices" when using PUT for partial updates, for those who choose practicality over purity? As I see it, basically this: When you PUT the update, include the properties you want to update, leave out the properties you don't want to update, and for any properties you want to delete, set them null.

 

 Pragmatic partial updates with PUT

  1. Include properties to be updated
  2. Don't include properties not to be updated
  3. Set properties to be 'deleted' to null
The reality is that most data is going to be stored in a database that, even if it's a NoSQL database, has an implicit or explicit schema that describes what sort of data your application is expecting. If you're using a relational database, this will end up being columns in your database tables, some of whose values may be null. In this scenario it makes perfect sense to "delete" properties by setting them null, since the database columns are not going to disappear in any case. And for those who use a document database, its not a stretch to delete nullified properties.

Further reading

1. http://williamdurand.fr/2014/02/14/please-do-not-patch-like-an-idiot/
2. http://techblog.appnexus.com/2012/on-restful-api-standards-just-be-cool-11-rules-for-practical-api-development-part-1-of-2/
3. http://tools.ietf.org/html/draft-ietf-appsawg-json-patch-07

Monday, April 7, 2014

REST API Best Practices 2: HTTP and CRUD

This post expands a bit further on the REST API Cheat Sheet regarding HTTP operations for Create / Read / Update / Delete functionality in REST APIs.

APIs for data access and management are typically concerned with four actions (the so-called CRUD operations):
  • Create - the ability to create a resource
  • Read - the ability to retrieve a resource
  • Update - the ability to modify a resource
  • Delete - the ability to remove a resource

CRUD operations don't have a perfect, 1-to-1 mapping to HTTP methods, which has led to different opinions and implementations, but the following list represents best practice as I see it in the industry today, and follows the HTTP specification:

CRUD Operation    HTTP Method
CreatePOST
ReadGET
UpdatePUT and/or PATCH
DeleteDELETE

To reiterate, HTTP methods can be used to implement CRUD oprations as follows:
  • POST - create a resource
  • GET - retrieve a resource
  • PUT - update a resource (by replacing it with a new version)*
  • PATCH - update part of a resource (if available and appropriate)*
  • DELETE - remove a resource

Although PATCH is considered the officially correct and "RESTful" way to do partial updates, it has yet to gain wide adoption. Many popular web application frameworks don't support the PATCH method yet, so in practice, it is not uncommon to use PUT for partial updates even though its not strictly "RESTful". The decision to use PUT vs. PATCH for partial updates is driven by the capabilities of your framework of choice (Rails only recently introduced PATCH, for example) and by the practical requirements of building web/mobile back-end services that actually work and are easy to use, even if they don't satisfy REST purists. More on this in the next post.

 

Safe and Idempotent Methods

 

The HTTP 1.1 specification defines "safe" and "idempotent" methods [1].  Safe methods don't modify data on the server no matter how many times you call them. Idempotent methods can modify data on the server the first time you call them, but repeating the same call over and over again won't make any difference. Here's a partial list:

Method    Safe    Idempotent
GET
HEAD
PUT×
PATCH×
DELETE×
POST××

The safe and/or idempotent nature of these HTTP methods provides some further insight into how they ought to be used. Notice that POST is neither safe, nor idempotent. A successful POST should create new data on the server, and repeating the same call should create even more copies on the server. GET, on the other hand, is safe and idempotent, so no matter how many times you call it, the data on the server shouldn't be affected.

GET - use it to fetch resources, but don't "tunnel" request parameters through to the server as a way to alter the state of data on the server - as a "safe" method, calling GET shouldn't have side effects.

PUT - use it to update an existing resource by replacing it with a new representation. The data you PUT to the server should be a complete replacement for the specified resource. Although PUT can in theory be used to insert new resources, in practice it's not advisable. Note that after the first PUT request, repeatedly calling the same PUT method with the same data won't change the data on the server more than it already has been (a condition of idempotent methods).

PATCH - if this method is available and well supported in both your client and server side technology stack (ie. Rails 4), consider using it to update part of an existing resource by changing some of it's properties, following the recommendations of the framework for how to submit the change descriptions. The PATCH method isn't supported everywhere and not common enough to be considered a current best practice, but the industry seems to be moving this way and technically it's the correct way to provide partial updates according to the HTTP spec [2].

If your server, framework or client user base (IE8, etc) doesn't support PATCH, rest assured that many developers take the pragmatic approach and simply bend the rules to use PUT for partial updates [3]. I'll cover this in the next post in more detail. Note that, no matter how you do your partial update, it should be atomic, that is once the update has started, it should not be possible to retrieve a copy of the resource until the update has been fully applied.

POST - use it to create new resources. The server should create a unique identifier for each newly created resource. Return a 201 Created response if the request was successful. Best practice appears to be to return the unique ID in the response. POST is also frequently used to trigger actions on the server which technically aren't part of RESTful API, but provide useful functionality for web applications.

DELETE - use it to delete resources; it's pretty self-explanatory.

More posts in this series

REST API Best Practices 1: A REST Cheat Sheet
REST API Best Practices 3: Partial Updates - PATCH vs. PUT
REST API Best Practices 4: Collections, Resources and Identifiers


[1] http://www.w3.org/Protocols/rfc2616/rfc2616-sec9.html
[2] http://stackoverflow.com/questions/19732423/why-isnt-http-put-allowed-to-do-partial-updates-in-a-rest-api
[3] http://techblog.appnexus.com/2012/on-restful-api-standards-just-be-cool-11-rules-for-practical-api-development-part-1-of-2/