Modeling Aggregated Datasets in JavaScript

Lets do a little thought experiment. Picture a bowl of M&Ms. Any good candy aficionado knows that these little colored, button shaped candies come in six distinctive colors: Brown, Yellow, Green, Red, Orange, and Blue. Given a bowl of M&Ms, however, you could never know the distribution of each individual color. It would be impossible to parse this out without sorting them and manually counting each color. Now picture you have six smaller bowls each filled with a single color M&Ms. This gives you a bit more options. Instead of tallying up individual colors you could determine the weight of a single candy, the weight of the empty bowl, and the weight of the full bowl. You could then extrapolate the quantity of each color.
So, candy notwithstanding, what we have just demonstrated above is a fundamental mechanism of the human brain: The ability to process data much more easily when it is categorized. Data suffers from much the same problem. Generally data is arbitrarily stored in tables, databases, or API results. It is not categorizes, sorted, or filtered in ways that lend themselves to human comprehension. It is up to those of us who our capturing this data and exposing it to users to tidy it up in such a way that is easy for someone to grok. We need to let the robots sort the M&Ms into the bowls by color.
So what, exactly, is the best way to accomplish this. Some languages like SQL, Python, and R are designed to handle data natively. In JavaScript we have recently gained some methods in the Array prototype for mapping and reducing data that help with this. These are fairly complex, however, and they don’t offer a great way to categorize, filter, and sort data on the fly.
I have been recently working on a library called jschema.js that is an API for working with datasets in JavaScript. It gets it’s name from the fact that it groups all of your data together in a schema (much like a database schema) so you can work with data much more easily in JavaScript. What we are going to be looking at here is how we use jschema.js to aggregate data. The very first thing we want to do is create a new schema, fetch a JSON dataset, and add it to our schema:
var s = new jSchema;
fetch("data/iris.json")
.then(response => response.json())
.then(json => s.add(json, {
name: "iris",
primaryKey: "id"
}))
This data is a raw dataset for identifying various types of Iris’. In other words our “big bowl of M&Ms”. Once we have this dataset added into our schema we can easily create aggregations of the dataset using the groupBy method:
.then(function() {
s.groupBy("IRIS", {
dim: "SPECIES",
metric: "PETALWIDTHCM",
name: "SPECIES",
method: "AVERAGE"
})
We are passing two parameters to our method: The dataset from the schema and the options. The options consist of the dimension (the characteristic we want to group by), the metric (the value we want to aggregate), the name (the name of the output table will have in the schema) and the method (the method by which we wish to aggregate the metric).
The method will create a new dataset in our schema called species which will have the average petal width by species. The idea being that you can quickly and easily change the aggregation methods, dimensions, and the metrics to further analyze your data without having to resort to a lot of convoluted Array manipulation. The output would look something like this:

All in all the idea is that aggregated data is much more valuable from a user perspective and it is useful to have a library to generate these tables on the fly. More information on jschema.js can be found on Github and NPM along with demos (including the Iris demo I described above). So tear open a bag of M&Ms and start peeling away the layers of complexity in your datasets.
- BG