Department(s):

Data Science

Finding Energy Trends with Time Series Data

Image removed.

This article was co-authored by Angela Bassa and Husain Al-Mohssen from EnerNOC’s data science team.

Energy data is only as valuable as you make it. With a combination of energy intelligence software (EIS), a strong analytics stack, and expert knowledge, businesses are not only amassing massive quantities of data—they’re making informed decisions as a result.

At EnerNOC, we’ve been pursuing advancements in the insights we can get from time series data. With access to hundreds of thousands of time series, we face a substantial opportunity to explore high-impact energy consumption trends—assuming we can scale the project and deliver tools that make it easy to put this data to use.

When compared side-by-side, different kinds of time series data can unveil highly valuable insights into energy consumption and cost-reduction opportunities. For a large organization, tools that find similarities between time series data could enable analysts and consultants to compare energy consumption with external factors, such as regional temperature and stock market performance, for example. The ability to uncover hidden patterns across different kinds of data can be very powerful, enabling analysts to create models and baselines, maintain data quality, and make predictions and forecasts of future energy trends.

Say you’re a large organization with an array of solar panels; with the right tools, you can compare metering data with your sunshine and cloud coverage. Naturally, you’ll see a correlation between the energy generated from those panels and cloud coverage. Based on the baseline of expectations, you could look for fluctuations in energy generation that do not correlate with cloud coverage, and find errors that may be caused by faulty meters. Follow this line of thinking across an entire organization and the opportunities are unlimited.

This year, EnerNOC sponsored a program with the data science graduate program at Worcester Polytechnic Institute aimed at turning this data into a useful tool for energy managers. Commenting on the project, WPI’s data science department said:

WPI is one of only a handful of universities that prepares graduates to work in the rapidly expanding field of Data Science. Students in the WPI Data Science master’s program work closely with faculty and peers to synthesize huge amounts of digital information from multiple sources, deriving new insights and articulating these findings into innovative solutions for how we live, work, and interact with the world around us.

Here’s an overview of how the hard work both teams put toward this effort have paid off.

What We Accomplished

Off the bat we knew that scalability would be a principle concern. EnerNOC deals with a very large number of clients with multiple years’ worth of data that can be at the granularity of minutes or less. For the tool to be useful, it needs to be able to accommodate not only the huge amounts of data that we have today, but the even larger amounts of data that will be introduced as the company grows.

To this end, we chose the Apache Spark computing platform, providing an environment that facilitates writing parallel, scalable code. EnerNOC uses the R programming language as one of our standard tools in the analytical stack. Code written in Scala can also execute R code using RServe on the Spark platform, allowing us to leverage R throughout development.

In order to accomplish our goal, we needed to upload a significant amount of data—on the order of hundreds of thousands of times series. But because we built this from the ground up with the Spark platform, we could scale up to millions of time series: when we need to scale further, we can now simply add more computing nodes to a Spark cluster. The additional cores would allow us to handle however many additional time series we needed.

For database scaling, we turned to the Apache Cassandra management system. Cassandra is highly touted throughout the industry, and we found it lives up to the hype in terms of both performance and ease of use. It also features standard integration with Spark.

We also needed to design a user interface that would enable engineers and energy analysts to use the data in an easy manner. This involved an iterative process with many one-to-one meetings with various members of the data science team interested in problems ranging from anomaly detection to baseline generation. The feedback taken from these meetings proved to be invaluable to the process, helping clarify exactly how people would use the tool and informing our decisions when developing the interface.

Altogether, we were able to not only find similar time series, but also present their inter-relationship in an actionable manner through the interface. Our hierarchical clustering approach allows us to visualize the site which is closest to the centroid of each cluster. This gives a sense of what the different “typical” sites look like for eac­h of these clusters. We found a confirmation of our qualitative understanding of the data, using the simple “average week” metric.

Image removed.

A hypothetical example of the insights made available with the tool (click to enlarge).

We learned something about the “average” sites using a simple weekly similarity metric. The average “composite” site, i.e. the centroid, has flat energy consumption during the week, and slightly less so on the weekends. This makes intuitive sense, as we’d expect lower weekend energy use in a portion of the sites we used in our dataset. In the hierarchy of the clustering, we can see sites “branch out” to different types. The above the visuals show weekly consumption, with the five-days-per-week and seven-days-per-week “branches”.

In the process of designing and developing this analysis, the students learned a lot about how to design a tool to be used in real-world applications. They saw how a project like this can be adjusted incrementally to make the best use of their implementation and the UI, and deployed code that would make the tool useful on different hardware and using different sources of data.