Using Big Query to Calculate Degree Days and K-Factor to Better Estimate Home Energy Usage

Cloud Jake
codeburst
Published in
12 min readSep 21, 2020

--

Google Sheets, Big Query, and Public Data Sets — Calculating Degree Days and K-Factor.

Car dashboard speedometer with focus on 99MPG fuel economy

Have you ever wondered how home heating fuel companies know when to make a delivery? Even if you don’t get propane or home heating oil deliveries, you can still estimate your fuel usage over time by using the same formula used by energy companies by leveraging public weather datasets in BigQuery and your past heating bills.

Much like tracking fuel economy (measured by miles per gallon) in your vehicle, energy companies use a formula that relies on historical weather data (measured in degree days) and a calculated value called the K-Factor — which represents the number of gallons of fuel burned per degree day (degree days per gallon). In this exercise, we’ll explore how to:

  • Calculate degree days for a period using public weather datasets
  • Calculate the K-Factor for your home by looking at past energy bills

Assemble Past Delivery Data

Even if you don’t track your heating fuel deliveries, you should able to download billing information from your provider. At a minimum, we’ll need:

  • Date of delivery
  • Number of gallons delivered
  • Cost per gallon (optional)

Let’s put this data in a Google Sheet so we can then use it in BigQuery.

I started tracking this information in 2011 and have been using a Google Sheet to record the information above as well as a few other simple calculations (such as cost per month, gallons per day, etc). While Google Sheets is great for these simple calculations, I’ve not been able to find an easy way to incorporate weather data and do complex calculations.

Query Google Sheet Data from BigQuery

Now that we’ve recorded delivery data in a Google Sheet, we’ll need to make the Sheet available in BigQuery by querying Google Drive data. We’ll create a new table in the home data dataset.

image of homedata BigQuery Dataset

Click CREATE TABLE.

BigQuery Create Table dialog

Under Source, click the drop-down to Create table from: and select Drive.

Create Table From: dialog

Paste in the Share Link from Google Sheets in the Select Drive URI field and select Google Sheets from the File format: dropdown. Complete the form by entering the following information:

  • Table Name: heating_oil
  • Schema: note Name, Type, and Mode in example above (for 3 columns in Google Sheet)
  • Advanced: Headers Rows to Skip: 1
View of heating_oil table schema

Let’s confirm that we can query the data from Google Drive:

SELECT * FROM `bq-jake.homedata.heating_oil`
Query result for home heating oil

Preparing Delivery Data

Since we’ll be calculating the number of days since the last delivery, let’s better organize the delivery dates so that we can calculate consumption and degree days between the current delivery and last delivery. To pair the last delivery date (previous row) with the current date, we’ll use the LAG function.

SELECT  delivery_date, 
LAG(delivery_date)
OVER (ORDER BY delivery_date) as previous_delivery,
FROM `bq-jake.homedata.heating_oil`
WHERE delivery_date is not null
ORDER BY delivery_date
example of LAG BigQuery function

To project out to today’s date from the last delivery, let’s add today’s date to the list. We’ll need this later when we calculate a delivery for “today”.

with get_deliveries as (
SELECT delivery_date,
LAG(delivery_date)
OVER (ORDER BY delivery_date) as previous_delivery,
FROM `bq-jake.homedata.heating_oil`
WHERE delivery_date is not null
ORDER BY delivery_date
)
-- append Today's date to end of list so we can estimate for today
,append_today as (
SELECT CURRENT_DATE('America/New_York') as delivery_date,
max(delivery_date) as previous_delivery
from get_deliveries
)
SELECT * FROM get_deliveries
UNION ALL
SELECT * FROM append_today
Note: Today’s Date is 2020–09–20

Finally, let’s calculate the number of days between deliveries and add it as an additional column. We’ll use the DATE_DIFF function to calculate the number of days between deliveries and then the GENERATE_DATE_ARRAY function to list all of the dates excluding the next delivery date (removed with DATE_SUB):

with get_deliveries as (
SELECT delivery_date,
LAG(delivery_date) OVER (ORDER BY delivery_date) as previous_delivery,
FROM `bq-jake.homedata.heating_oil`
WHERE delivery_date is not null
ORDER BY delivery_date
)
-- append Today's date to end of list so we can estimate for today
,append_today as (
SELECT CURRENT_DATE('America/New_York') as delivery_date,
max(delivery_date) as previous_delivery
from get_deliveries
)
,all_deliveries as (
SELECT * FROM get_deliveries
UNION ALL
SELECT * FROM append_today
)
SELECT delivery_date,
previous_delivery,
DATE_DIFF(delivery_date, previous_delivery, DAY) as CALC_days, -- Number of days between deliveries
GENERATE_DATE_ARRAY(previous_delivery,DATE_SUB(delivery_date, INTERVAL 1 DAY), INTERVAL 1 DAY) as CALC_dates, -- list of dates between delivieries (exclude next delivery date)
from all_deliveries

We now see the additional column CALC_days with the number of days between deliveries. We also have an additional column named CALC_dates which is an array of dates starting at the previous delivery date going through the (day before the) delivery date (since it was excluded with DATE_SUB).

In order to join this data more easily, let’s flatten the array of dates between deliveries (CALC_dates) using the UNNEST command.

with get_deliveries as (
SELECT delivery_date,
LAG(delivery_date) OVER (ORDER BY delivery_date) as previous_delivery,
FROM `bq-jake.homedata.heating_oil`
WHERE delivery_date is not null
ORDER BY delivery_date
)
-- append Today's date to end of list so we can estimate for today
,append_today as (
SELECT CURRENT_DATE('America/New_York') as delivery_date,
max(delivery_date) as previous_delivery
from get_deliveries
)
,all_deliveries as (
SELECT * FROM get_deliveries
UNION ALL
SELECT * FROM append_today
)
,calc_days as (
SELECT delivery_date,
previous_delivery,
DATE_DIFF(delivery_date, previous_delivery, DAY) as CALC_days, -- Number of days between deliveries
-- GENERATE_DATE_ARRAY(previous_delivery,delivery_date, INTERVAL 1 DAY) as CALC_dates, -- list of dates between delivieries
GENERATE_DATE_ARRAY(previous_delivery,DATE_SUB(delivery_date, INTERVAL 1 DAY), INTERVAL 1 DAY) as CALC_dates, -- list of dates between delivieries (exclude next delivery date)
from all_deliveries
)
-- unnest days by delivery date so we can join to weather data
SELECT delivery_date, previous_delivery, dates_flattened
from calc_days, UNNEST(CALC_dates) as dates_flattened

Find a nearby Weather Station

We’ll start by gathering historical weather data from a nearby weather station. I’ve outlined these steps in the following Medium article:

Finding the Closest Weather StationsBigQuery Public Datasets

Note the following identifiers:

  • stn — Station number (WMO/DATSAV3 number). Ex: 725037
  • wban — historical “Weather Bureau Air Force Navy” number. Ex: ’94745

Once you’ve identified a nearby weather station with quality data, we’ll need to construct a query to gather the daily mean temperature reading for the duration of our analysis. Since the historical weather data is a large dataset, broken across several large tables (by year), we’ll need to be careful about how and how often we query this data.

To query multiple years at once, we’ll take advantage of querying wildcard tables (tables of similar name and structure). For this query, we’ll only include the tables for the years 2011–2020 using the _TABLE_SUFFIX option in the WHERE clause:

SELECT CAST(CONCAT(year,'-',mo,'-',da) as DATE) as DATE,
temp
FROM
`bigquery-public-data.noaa_gsod.gsod20*`
WHERE
max != 9999.9 # code for missing data
AND _TABLE_SUFFIX BETWEEN '11'
AND '20'
AND stn = '725037'
AND wban = '94745'
ORDER BY DATE
BigQuery results for mean temperature by day

The query quickly returned the > 3500 rows (one value per day) that we were expecting, but the query processed 1.7GB of data. It’ll be very expensive to run this every time, so let’s store it as a temp table by clicking the SAVE RESULTS button, then BigQuery Table. I saved this data is a temp dataset as a table named gsod_dates

Save Query dialog box

Calculating Degree Days

A degree day is calculated by taking the mean temperature in a given day and subtracting from the base temperature of 65. For example, on a given winter day where the mean temperature for the day was 40 degrees, the number of degree days would be 65–40=25.

To complicate things a bit further, there is an adjustment for systems that also produce hot water. Since the mean temperature in summer months is typically above 65 (resulting in zero degree days), we need to compensate for hot water production from the heating system on those days. In the CASE statement below we compensate for hot water production based on a published conversation chart.

with get_deliveries as (
SELECT delivery_date,
LAG(delivery_date) OVER (ORDER BY delivery_date) as previous_delivery,
FROM `bq-jake.homedata.heating_oil`
WHERE delivery_date is not null
ORDER BY delivery_date
)
-- append Today's date to end of list so we can estimate for today
,append_today as (
SELECT CURRENT_DATE('America/New_York') as delivery_date,
max(delivery_date) as previous_delivery
from get_deliveries
)
,all_deliveries as (
SELECT * FROM get_deliveries
UNION ALL
SELECT * FROM append_today
)
,calc_days as (
SELECT delivery_date,
previous_delivery,
DATE_DIFF(delivery_date, previous_delivery, DAY) as CALC_days, -- Number of days between deliveries
-- GENERATE_DATE_ARRAY(previous_delivery,delivery_date, INTERVAL 1 DAY) as CALC_dates, -- list of dates between delivieries
GENERATE_DATE_ARRAY(previous_delivery,DATE_SUB(delivery_date, INTERVAL 1 DAY), INTERVAL 1 DAY) as CALC_dates, -- list of dates between delivieries (exclude next delivery date)
from all_deliveries
)
,flatten_dates as (
-- unnest days by delivery date so we can join to weather data
SELECT delivery_date, previous_delivery, dates_flattened
from calc_days, UNNEST(CALC_dates) as dates_flattened
)
,GSOD_dates as (
SELECT * FROM `bq-jake.temp.gsod_dates`
)
SELECT
delivery_date,
previous_delivery,
CAST(temp as Numeric) as mean_temp,
CASE
WHEN 65-CAST(temp as Numeric) > 0 THEN 65-CAST(temp as Numeric)
ELSE 0
END as CALC_degree_days,
CAST(CASE
WHEN CAST(temp as Numeric) >= 62 THEN 6
WHEN CAST(temp as Numeric) >= 58 THEN 5
WHEN CAST(temp as Numeric) >= 54 THEN 4
WHEN CAST(temp as Numeric) >= 50 THEN 3
WHEN CAST(temp as Numeric) >= 46 THEN 2
WHEN CAST(temp as Numeric) >= 43 THEN 1
ELSE 0.0
END as Numeric)as CALC_hot_water
-- http://www.degreeday.com/faqs.aspx
FROM flatten_dates
LEFT JOIN GSOD_dates
ON DATE = dates_flattened
Results of expanded BQ SQL to calculate Degree Days and Hot Water compensation

Let’s wrap up this data by combining the calculated degree days and compensation for hot water production and then summing across delivery dates.

with get_deliveries as (
SELECT delivery_date,
LAG(delivery_date) OVER (ORDER BY delivery_date) as previous_delivery,
FROM `bq-jake.homedata.heating_oil`
WHERE delivery_date is not null
ORDER BY delivery_date
)-- append Today's date to end of list so we can estimate for today
,append_today as (
SELECT CURRENT_DATE('America/New_York') as delivery_date,
max(delivery_date) as previous_delivery
from get_deliveries
)
,all_deliveries as (
SELECT * FROM get_deliveries
UNION ALL
SELECT * FROM append_today
)
,calc_days as (
SELECT delivery_date,
previous_delivery,
DATE_DIFF(delivery_date, previous_delivery, DAY) as CALC_days, -- Number of days between deliveries
-- GENERATE_DATE_ARRAY(previous_delivery,delivery_date, INTERVAL 1 DAY) as CALC_dates, -- list of dates between delivieries
GENERATE_DATE_ARRAY(previous_delivery,DATE_SUB(delivery_date, INTERVAL 1 DAY), INTERVAL 1 DAY) as CALC_dates, -- list of dates between delivieries (exclude next delivery date)
from all_deliveries
)
,flatten_dates as (
-- unnest days by delivery date so we can join to weather data
SELECT delivery_date, previous_delivery, dates_flattened
from calc_days, UNNEST(CALC_dates) as dates_flattened
)
,GSOD_dates as (
SELECT * FROM `bq-jake.temp.gsod_dates`
)
,get_degree_days as (
SELECT
delivery_date,
previous_delivery,
CAST(temp as Numeric) as mean_temp,
CASE
WHEN 65-CAST(temp as Numeric) > 0 THEN 65-CAST(temp as Numeric)
ELSE 0
END as CALC_degree_days,
CAST(CASE
WHEN CAST(temp as Numeric) >= 62 THEN 6
WHEN CAST(temp as Numeric) >= 58 THEN 5
WHEN CAST(temp as Numeric) >= 54 THEN 4
WHEN CAST(temp as Numeric) >= 50 THEN 3
WHEN CAST(temp as Numeric) >= 46 THEN 2
WHEN CAST(temp as Numeric) >= 43 THEN 1
ELSE 0.0
END as Numeric)as CALC_hot_water
-- http://www.degreeday.com/faqs.aspx
FROM flatten_dates
LEFT JOIN GSOD_dates
ON DATE = dates_flattened
)
SELECT delivery_date, previous_delivery, SUM(CALC_degree_days + CALC_hot_water) as degree_days
from get_degree_days
GROUP BY delivery_date, previous_delivery
BigQuery results

Calculating the K-Factor

Now that we’ve summed the degree days between deliveries, we need to add back in the gallons delivered (from our Google Sheet brought into BigQuery). To determine the K-Factor for a given delivery, all we need to do is divide the number of degree days since the last delivery by the number of gallons delivered: degree_days/gallons_delivered.

Additionally, we’ll sneak in today’s date between the last delivery in the CASE statement.

with get_deliveries as (
SELECT delivery_date,
LAG(delivery_date) OVER (ORDER BY delivery_date) as previous_delivery,
FROM `bq-jake.homedata.heating_oil`
WHERE delivery_date is not null
ORDER BY delivery_date
)
-- append Today's date to end of list so we can estimate for today
,append_today as (
SELECT CURRENT_DATE('America/New_York') as delivery_date,
max(delivery_date) as previous_delivery
from get_deliveries
)
,all_deliveries as (
SELECT * FROM get_deliveries
UNION ALL
SELECT * FROM append_today
)
,calc_days as (
SELECT delivery_date,
previous_delivery,
DATE_DIFF(delivery_date, previous_delivery, DAY) as CALC_days, -- Number of days between deliveries
-- GENERATE_DATE_ARRAY(previous_delivery,delivery_date, INTERVAL 1 DAY) as CALC_dates, -- list of dates between delivieries
GENERATE_DATE_ARRAY(previous_delivery,DATE_SUB(delivery_date, INTERVAL 1 DAY), INTERVAL 1 DAY) as CALC_dates, -- list of dates between delivieries (exclude next delivery date)
from all_deliveries
)
,GSOD_dates as (
SELECT * FROM `bq-jake.temp.gsod_dates`
)
,get_degree_days as (
SELECT
delivery_date,
previous_delivery,
CAST(temp as Numeric) as mean_temp,
CASE
WHEN 65-CAST(temp as Numeric) > 0 THEN 65-CAST(temp as Numeric)
ELSE 0
END as CALC_degree_days,
CAST(CASE
WHEN CAST(temp as Numeric) >= 62 THEN 6
WHEN CAST(temp as Numeric) >= 58 THEN 5
WHEN CAST(temp as Numeric) >= 54 THEN 4
WHEN CAST(temp as Numeric) >= 50 THEN 3
WHEN CAST(temp as Numeric) >= 46 THEN 2
WHEN CAST(temp as Numeric) >= 43 THEN 1
ELSE 0.0
END as Numeric)as CALC_hot_water
-- http://www.degreeday.com/faqs.aspx
FROM flatten_dates
LEFT JOIN GSOD_dates
ON DATE = dates_flattened
)
,sum_degree_days as (
SELECT delivery_date, previous_delivery, SUM(CALC_degree_days + CALC_hot_water) as degree_days
from get_degree_days
GROUP BY delivery_date, previous_delivery
)
SELECT CASE
WHEN a.delivery_date is NULL THEN CURRENT_DATE('America/New_York') -- so we can estimate for today's date
ELSE a.delivery_date
END as delivery_date,
a.gallons,a.cost_per_gallon,
c.previous_delivery,
b.degree_days,
degree_days/gallons as Kfactor
FROM sum_degree_days b, all_deliveries c
LEFT JOIN `bq-jake.homedata.heating_oil` a
ON a.delivery_date = b.delivery_date
WHERE c.previous_delivery = b.previous_delivery
BigQuery results

In order to project “today’s” delivery, we’ll calculate a mean K-Factor based on our delivery and usage history. We’ll simply sum all of the K-Factors that we’ve calculated per deliver and divide by the number of deliveries to estimate a mean K-Factor to associate with “today’s” delivery.

As a bonus, we’ll create a comments field to explain what’s happening in each row.

with get_deliveries as (
SELECT delivery_date,
LAG(delivery_date) OVER (ORDER BY delivery_date) as previous_delivery,
FROM `bq-jake.homedata.heating_oil`
WHERE delivery_date is not null
ORDER BY delivery_date
)
-- append Today's date to end of list so we can estimate for today
,append_today as (
SELECT CURRENT_DATE('America/New_York') as delivery_date,
max(delivery_date) as previous_delivery
from get_deliveries
)
,all_deliveries as (
SELECT * FROM get_deliveries
UNION ALL
SELECT * FROM append_today
)
,calc_days as (
SELECT delivery_date,
previous_delivery,
DATE_DIFF(delivery_date, previous_delivery, DAY) as CALC_days, -- Number of days between deliveries
-- GENERATE_DATE_ARRAY(previous_delivery,delivery_date, INTERVAL 1 DAY) as CALC_dates, -- list of dates between delivieries
GENERATE_DATE_ARRAY(previous_delivery,DATE_SUB(delivery_date, INTERVAL 1 DAY), INTERVAL 1 DAY) as CALC_dates, -- list of dates between delivieries (exclude next delivery date)
from all_deliveries
)
,flatten_dates as (
-- unnest days by delivery date so we can join to weather data
SELECT delivery_date, previous_delivery, dates_flattened
from calc_days, UNNEST(CALC_dates) as dates_flattened
)
,GSOD_dates as (
SELECT * FROM `bq-jake.temp.gsod_dates`
)
,get_degree_days as (
SELECT
delivery_date,
previous_delivery,
CAST(temp as Numeric) as mean_temp,
CASE
WHEN 65-CAST(temp as Numeric) > 0 THEN 65-CAST(temp as Numeric)
ELSE 0
END as CALC_degree_days,
CAST(CASE
WHEN CAST(temp as Numeric) >= 62 THEN 6
WHEN CAST(temp as Numeric) >= 58 THEN 5
WHEN CAST(temp as Numeric) >= 54 THEN 4
WHEN CAST(temp as Numeric) >= 50 THEN 3
WHEN CAST(temp as Numeric) >= 46 THEN 2
WHEN CAST(temp as Numeric) >= 43 THEN 1
ELSE 0.0
END as Numeric)as CALC_hot_water
-- http://www.degreeday.com/faqs.aspx
FROM flatten_dates
LEFT JOIN GSOD_dates
ON DATE = dates_flattened
)
,sum_degree_days as (
SELECT delivery_date, previous_delivery, SUM(CALC_degree_days + CALC_hot_water) as degree_days
from get_degree_days
GROUP BY delivery_date, previous_delivery
)
,all_history as (
SELECT CASE
WHEN a.delivery_date is NULL THEN CURRENT_DATE('America/New_York') -- so we can estimate for today's date
ELSE a.delivery_date
END as delivery_date,
a.* EXCEPT(delivery_date),
c.previous_delivery,
b.* EXCEPT(delivery_date, previous_delivery),
degree_days/gallons as Kfactor
FROM sum_degree_days b, all_deliveries c
LEFT JOIN `bq-jake.homedata.heating_oil` a
ON a.delivery_date = b.delivery_date
WHERE c.previous_delivery = b.previous_delivery
)
,calc_kfactor as (
SELECT CURRENT_DATE('America/New_York') as DATE,
SUM(Kfactor) / COUNT(Kfactor) as mean_Kfactor
from all_history
)
,put_it_all_together as (
SELECT delivery_date,
previous_delivery,
CASE
WHEN gallons is not null THEN gallons
ELSE degree_days / mean_Kfactor
END as gallons,
cost_per_gallon,
degree_days,
CASE
WHEN Kfactor > 0 THEN Kfactor
WHEN Kfactor is null THEN mean_Kfactor
ELSE null
END as Kfactor,
CASE
WHEN company is null THEN CONCAT('Kfactor estimated from mean Kfactor, gallons calculated to be delivered on ', delivery_date)
ELSE CONCAT(gallons,' gallons were delivered on ',delivery_date,' with ',degree_days,' degree days since last delivery on ',previous_delivery)
END as Comments
FROM all_history
LEFT JOIN calc_kfactor
ON DATE = delivery_date
)
SELECT * FROM put_it_all_together
ORDER BY delivery_date DESC

Conclusion

And there you have it — a guide to estimating your fuel usage over time! Now, the next time the fuel oil delivery truck shows up, I can wow my delivery driver by accurately predicting the number of gallons he or she is about to deliver!

Full code on GitHub.

--

--