Blog | Cloud Carbon Footprint

CCF Data Collection: On-Premise

September 7, 2023 · 12 min read

Kenneth Sambrook

Senior Software Engineer (Security) @ Electronic Arts

Forward

Embarking on the journey of collecting data for your on-premise resources can at first seem like an impossible task. What data do I actually need? Do you have the right tooling? What stakeholders are you going to need, and how much time is this going to take? Where do you store the data? Most importantly, how do you get the data once you identify the tooling available to you? We’ll cover all of that below, but first just know that despite your initial trepidation; endpoint data exists in countless tools. With a little creativity and knowhow, you should have no trouble identifying, retrieving, and storing endpoint data to be utilized within the CCF Dashboard..

Getting Started With On-Premise Data Collection

The On-premise data model

The CCF CLI takes in on-premise data using a predefined model. This model dictates the type, content, and format of the data needed in order to facilitate and proper and accurate data import and manipulation. You can view the current custom data model, as well as methodology write ups in the On-Premise parts of the CCF docs . The data model contains a lot of common fields such as CPU cores, memory, and where in the world a particular machine is located. We’ll cover these in more detail as we continue on, however just keep in mind that we need all of this data for each endpoint to provide accurate estimates.

Data Sources

Your organization will most likely contain a multitude of sources from which endpoint data can originate. From computer accounts in Active Directory to System information stored within tools such as your Antivirus or system patching tools; large swathes of data are stored. It’s in these places you’ll be looking for your target endpoints and pulling relevant data from these sources to populate the On-Premise data model.

Endpoint data can also come from some unlikely places. If you are finding yourself lost in your search for endpoint data, think of the following list of tooling to see if perhaps your IT department has one of these tools:

Antivirus suites
Vulnerability management suites
CMDB applications (Inventory management)
Configuration management suites
- Puppet
- Chef
- SCCM
- MEM
System monitoring suites
- Nagios
- Cacti
- Grafana
  - Node Exporter
  - Telegraf
  - Prometheus Databases

The list goes on and on. There are countless tools currently available for organizations to perform a wide range of tasks related to endpoint management, monitoring, and protection. The wonderful thing about tooling such as this is that they also rely on large amounts of data related to the endpoint. Antivirus tools need to know if the machine has been online and communicating with the management console for updates and status information. Monitoring tools rely solely on doing exactly what their name suggests–monitoring endpoints. CMDB tools are directly responsible for endpoint and inventory management. Many IT departments consider CMDB tooling a source of endpoint and inventory truth. Data here is likely to be very accurate.

NOTE: Keep in mind the permissions needed to access these data sources, and ensure you’re following your company's best practices on the retrieval and storage of this data. Where possible, always partner with the data source owner to ensure you’re not only getting the best data, but handling it in a manner that doesn’t expose your company to unnecessary risks or security incidents.

Data sources are the most important part of embarking on a journey of tracking and estimating the environmental impacts of your enterprise. As you begin, be creative and leverage your existing partnerships with various departments within your organization to find these unlikely sources of data. Continue to rely on the On-premise Data Model to match data sources with the required fields.

When choosing a data source, be mindful of the size of the dataset, and your ability to continually pull data out. Intermittent, or services with unreliable uptimes are not preferable sources. Data sources with large datasets that cannot be filtered or compressed may prove difficult or costly to ingest. Always try to choose data sources that have exposed API’s if possible. This will make automating and scheduling data collection far simpler in the long run. If API’s are simply not possible, strive instead for tooling that can generate scheduled reports containing the data you need. If possible opt for CSV, JSON, XML outputs that you can use easily with your data collection and storage processes. We’ll cover this in the section Collecting on-premise data.

Before we move on to the means and methods in which you might collect and store this data, bear in mind that some of the fields in the On-premise Data Model are to be collected over time. We will go into greater detail about these fields later on. Just understand that your chosen data source does not necessarily have to include this data. You may be able to generate the data with good automation, and scheduled data retrieval from other information provided by the data source.

dailyUptime
weeklyUptime
monthlyUptime
annualUptime

Once you have a source of data to utilize, you’ll be using that data to calculate the above fields.

Data Collection and Storage Technologies

For example’s sake, let’s say you have chosen your AntiVirus as your initial data source. This tool contains all the basic information needed for the On-premise Data Model, and it has a robust REST API that can be utilized to fetch the data. Where do we go from here? Let’s have a look at a basic workflow diagram.

Workflow Diagram

This is the simplest possible way to represent the actions to be taken when it comes to collecting and storing data. Although simple to visualize, the means in which you achieve this goal can grow from very simple to very complex depending on the amount of data you will have, the length of time you’ll be holding onto it, and any security or risk considerations being taken into account. As with data sources, there are countless tools and techniques one can use to perform these actions.

Because we’ll need to capture this data over time, choose a data collection technology that allows for accurate and timely scheduled runs. Being able to collect data, daily, or by hour will greatly improve the accuracy of your data.

AWS Glue with AWS S3 Data Lake

This is an excellent cloud native approach to capturing large to extremely large amounts of data easily. AWS Glue provides capabilities such as scheduled, and interval based run initialization and the S3 Datalake can handle extremely large amounts of data.

Additionally you can pair this with AWS functionality such as AWS Athena, RDS, and Lambda to make transforming, and retrieving the data simple, effective, and accurate.

Django with Django-Rest-Framework, Celery-Beat, and PostgreSQL

If you have a Python developer available and are interested in a cost effective, or free solution; open source may be your best choice. Django is an extremely robust web framework built on the “View-Model” strategy. It contains an impressive number of built in capabilities including “CRUD” functionality which makes creating, and using database models a breeze.

With the addition of 2 additional plugins it can provide all the data storage and retrieval functionality you would need. Django-Rest-Framework makes creating API’s within Django simple, thus aiding in retrieving your stored data. Celery-Beat is a database-backed task scheduler that can make scheduling accurate and timely data collection a breeze.

Django also works incredibly well within containerized environments such as Docker, and Kubernetes.

On-premise Data - Data Model

Much of the data you’ll be collecting is very straightforward. Machine name, CPU type and memory count may be readily available. Some of the data however requires special consideration. Let’s explore those briefly.

Special Data Model Fields

The below data points will require special care when collecting the data. Take care to ensure that you build these considerations into your data model to ensure an accurate data collection process.

machineType:
- Different machine types will generate different power loads. The formula used in calculating power draw for different types of machines will vary based on the type. If your data source includes a type, you should ensure that it is formatted to be either “server”, “laptop”, or “desktop”. Being unable to provide this information will impact the accuracy of your power usage calculations.
cpuUtilization:
- Your data collection tool may be unable to provide this information. If that is the case, your best option is to make a best case estimation.
[daily,weekly,monthly,annual]Uptime:
- These 4 fields are the most important, and also require a good data collection process. Using your data sources, you’ll need to provide incremental totals for these fields using uptime data if available. Not being able to provide this data will lead to very inaccurate carbon impact estimations.

Collecting On-premise Data

Once you’ve identified a suitable data source and method, you can now begin the process of collecting and storing your on-premise data. For the purposes of the write-up we’re going to use an “Antivirus” as our data source.

The Data Source - Antivirus

Our data source is an Antivirus suite that provides up to the minute endpoint stats that cover all of the basic sources.

cpuDescription
memory
machineType (Server, Laptop, Desktop)
machineName

In addition to the basics, the data source also has some additional data which will prove useful to use later on.

lastAgentCommunication
- This tells us when the endpoint was last online. It will be useful in determining system uptime.
agentPublicIP
- Knowing the country and region where an endpoint resides, helps us calculate carbon intensity. The public IP can be used to identify the geolocation data associated with the agent to put this into practice.

Collecting the data

To make things easier for us, our data source also provides a REST API in which we can collect the data in real time. If your data source doesn’t provide an API, attempt to get regular reports in CSV, XML, or JSON format that you can use in lieu of making API requests.

Depending on the method of data collection and storage you’ve chosen, some coding is most likely going to be required. It would be extremely beneficial to enlist the help of a developer or data engineer to assist you in reliably collecting and storing this data. Having the ability to retrieve and parse data from an API, work with S3 or a simple database, and create basic automations will come in handy throughout this process. Remember the illustration above.

Workflow Diagram

You’ll want to ensure that whatever method you’ve chosen for data collection, that you’re able to perform the collection at a set timed interval. Tracking machine uptime is paramount to being successful here.

Calculating upTime hours

A number of fields in the On-premise Data Model, represent uptime hours of your agent over a period of time. As you collect uptime information about your endpoints, you will want to populate these fields and increment each individual counter.

Each uptime counter represents a historical period of time. Daily, weekly, yearly, as well as perhaps 30, 60, 90 day increments. These fields may be present in your data source from day one, but perhaps you may need to create and calculate these.

Uptime Diagram

In the diagram above is an oversimplified view of a data collection event for a single endpoint. Let's say for instance the counter you're incrementing is the dailyUptime counter. As data about the endpoint comes in it’s determined that the endpoint has been online in the last 1 hour. To increment the daily counter we first need to check the timestamp of when the counter was last reset. If the counter is less than 24 hours old, then we can increment the counter by 1 hour. If it is older than 24 hours we should reset the counter to 1. Additionally you should also reset the timestamp to a current date and time.

Here is a quick sample of pseudo code to illustrate a couple of these use cases.

Increment Daily Uptime Counters

Daily Uptime Code Snippet

Increment 30 Day Uptime Counters

Monthly Uptime Code Snippet

This same principle applies to all other values related to uptime. weeklyUptime, monthlyUptime, and yearlyUptime can all be calculated this way. You can also add additional uptime counters as you see fit; however ensure that the required uptime fields from the On-premise Data Model are present.

When creating the initial timestamps always remember to start the timestamp from the moment the machine was first added to the data model. It is not advisable to to create an arbitrary initial timestamp as this can cause your uptime fields to be wildly inaccurate.

Conclusion

Collecting on-premise data for activities such as patching, inventory, and lifecycle management has been happening across IT organizations for a very long time. An incredible amount of work has gone into the development and deployment of tooling to achieve those goals. As the world moves more towards implementing green initiatives to better shape their technology futures, being able to calculate and report on the environmental impact of our infrastructures grows as well.

Even though you may not have a fit for purpose tool to collect on-premise data, that doesn’t mean you can’t still get the data you need. This data most likely already exists in many other tools already in use by your organization. The Anti-Virus used in this post is clearly not meant for this task, but with a bit of trial and error, it can do exactly what is needed to gather all of the necessary data. Be creative and keep an open mind. Data exists everywhere.

This paper represents an exploratory project undertaken by EA employees to explore ways to leverage existing data and automate methods to calculate electricity use and emissions associated with on-premise endpoints. The statements and opinions expressed in this article are those of the author and do not represent how EA calculates emissions and do not constitute or imply an endorsement of a product, process or service.

Running and Deploying CCF in a Virtual Machine

August 23, 2023 · 11 min read

Arik Smith

Senior Software Engineer @ Thoughtworks

Since the launch of Cloud Carbon Footprint (CCF), our team always aimed to maintain and develop flexibility in both how the tool is used and the options for its deployment. With such a large variety of infrastructure across organizations, we strive to balance supporting as many environments out of the box as possible while also preserving the customization options you may need.

One of the most popular options for both piloting a CCF instance as a proof of concept as well as for an internal production environment is to run CCF within a virtual machine. After all, it is a great way to get your feet wet with CCF without bringing extra dependencies on your local machine, tying up any resources, and enabling options for an always-on service that expands beyond your local machine. Alternatively, you may want to get creative by customizing or automating how CCF fits into your use cases. For example, perhaps you may want to set up a cloud function that automatically fetches cloud estimates for every recent day, week, or month? Or maybe you want a fully accessible CCF API for members of your organization or team to query? If any of these benefits sound appealing then the VM option might be for you, and you’re in the right place!

Virtual machines come in many shapes and sizes across different cloud provider platforms, including AWS EC2, Google Cloud Compute Engine, and Azure Virtual Machine services. For the sake of this article, we’ll be focusing on how to create a CCF application running on an AWS EC2 instance. However, depending on the chosen operating system or distribution of your machine, the steps should be relatively the same!

Creating your EC2 Instance

CCF at its core is a Node application running Express.JS for its API and React for its client. So once you have the environment setup for one endpoint of the application, you’ll be able to run instances of CCF’s API, CLI, and Client on the same machine, and switch between them if desired. Before we do that, let’s create our machine and set up our environment.

To start, we assume that you already have the following:

An existing AWS account with permissions for creating and managing EC2 instances
Followed the steps for setting up billing data for your cloud provider (i.e. AWS steps 1-3)
Basic familiarity with navigating the AWS console

Let’s navigate to the EC2 dashboard and select the option to launch a new instance using the shiny “Launch Instances” button:

Launch Instances Button

From this point, we’ll be selecting the configuration options for our new machine. While the free tier may be tempting, we’ll go with a t2.medium. I’ve found that on smaller instances such as a t2.micro, the limited hardware can sometimes cause issues when installing node modules or running the app. So we could use the extra “oomph”. However, for the sake of your own instance, please consider the following:

If you’re expecting a large amount of estimates, consider a larger instance with higher memory and compute power.
If you plan on running a MongoDB instance on the same machine and persisting a large amount of estimates, consider increasing the storage capacity of your instance.
If costs and efficiency is top of mind, consider choosing the option for an ARM instance as you will not be able to migrate from a non-ARM instance afterwards.
Running EC2 instances incur costs! So make sure to stop/delete instances when done with them and that you choose a capable instance that fits within your budget.

Launch Instances Config

You may also notice that we’ll be going with an Ubuntu image as our operating system – a popular and widely accepted distribution that you can use with any VM host. You’re welcome to choose a different Linux-based operating system such as Amazon Linux in the case of an EC2 instance. Amazon Linux serves as a Linux distribution optimized for running in AWS. It is also optimized for running most Linux-based software making the steps you’ll follow almost identical. For the sake of keeping this tutorial a little more friendly for other cloud VM services, we’ll be sticking with Ubuntu.

After making some final decisions in creating a key pair (required for connecting via an SSH client) and choosing a security group, we’re going to hit the even more shiny “Launch Instance” button. After a short wait, you should see a notification that the instance has been created and is running. So let’s connect to it!

Side Note: If you’re more comfortable with the cloud provider CLI or other ways of configuring resources, these steps can be replicated using those methods as well.

Setting Up Your CCF Instance

Connect to your instance using your preferred method – whether it be in the cloud provider console or through a local terminal via SSH. Once you’re in, we will need to configure NPM and Node so that our server can support CCF’s code.

Installing Node.JS

We’ll be following the officially recommended steps for setting up Node.JS on an EC2 Instance. Henceforth, we will be relying on NVM to make this an easy process! NVM will manage our versions of Node for us, making migrating or downgrading easier.

curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.39.5/install.sh | bash

We then need to activate NVM by running the following command:

. ~/.nvm/nvm.sh

You can verify that it is active by running nvm -v in which you should see an output of 0.39.5 or similar.

CCF requires NodeJS 16 or later. While I prefer 18 since it is the current LTS release, you will need to run the following command – replacing the number with the version that you wish to install:

nvm install 18

Note: Make sure to check the latest version compatibility that your image supports. For example, Amazon Linux 2 does not support Node 18 at the time of writing this article.

NVM will also inform you that it is setting the current version as the default. This is useful for whenever you need to switch node versions or have node disabled for some reason, as you can use nvm use –default to re-enable it. You can verify that Node is successfully installed by running node -v, in which you should see an output of the version that you installed.

Installing Yarn

While installing node also automatically installs npm, CCF uses Yarn as its package manager. Fortunately, Node 16 and up make it super easy to install thanks to its inclusion of corepack.

To install Yarn, we simply need to enable corepack using the following command:

corepack enable

With a little magic, you can now use yarn -v to verify that yarn is now installed and enabled in your server. You’ll see that version 1 (yarn classic) is enabled. If you’ve been reading the CCF documentation, you may notice that it requires Yarn 3. Do not fear, CCF will take care of the upgrade during its installation.

Now for the fun part!

Installing CCF

There are multiple ways to install CCF, as noted in the Getting Started section. While you can clone the app and get full access to all of its packages and latest features as soon as they are available, we’re going to go with the Create App option as the more stable and simple solution.

To get started, we will run the create-app command with an additional flag to note that we wish to skip the yarn install step. This gives us flexibility in case we wish to connect our data using the guided install process, which will end up doing the installation step for us.:

npx @cloud-carbon-footprint/create-app --skip-install

You’ll be asked to install the latest version of the create-app package, in which you can reply “yes”.

Name Create App

When prompted for a name, feel free to choose whichever name you prefer your app to have. In this example we’ll be going my-ccf-app. Take note that this will also be the name of the directory that the app will be created in. So you may want to make sure the name is unique and matches any folder naming conventions you may have.

Afterwards, you’ll see that the script takes care of creating your files and moving them to a directory named after your app. Your app will be successfully created!

Create App Successful

Use cd my-ccf-app to switch to the directory of your app. You should see the following contents within your directory:

lerna.json  package.json  packages  tsconfig.json

Note: If you run yarn -v again while in the directory, you’ll see that the version has automatically updated to 3.1.1 like magic. ✨

At this moment, you can either connect your data by manually creating a .env file in either your packages/api or packages/cli directory based on the .env.template files in those same directories. Make sure to check out the documentation on how to connect data for your chosen cloud provider:

Alternatively, you can use the guided installation method mentioned above to run a friendly CLI program that will walk you through setting up your credentials and will create the .env files for you!

If you’d rather skip connecting your data altogether, you can also run with mock data instead and move on to the next step.

Running Your App

After following the setup method of your choice, let’s top things off by doing a yarn install. Once the dependencies are done installing, we’re good to start our app!

You can now use yarn start to concurrently start both the Client (react dashboard) and the API (express app).

If running the client or with mock data, your CCF Dashboard will be available at port 3000 of your instance. You can view the dashboard by navigating to the public IP of your instance followed by the corresponding port.
- Please note, you will need to configure your security or network settings to make this port available
You can also use yarn start-api instead to only run the API. You can verify that the API is running by making a request to one of the endpoints on port 4000 of the instance.
- For example, try making the following command in another terminal instance:
  `curl http://[your-ip]:4000/api/regions/emissions-factors`
- Please note, you will need to configure your security or network settings to make this port available if attempting to make requests outside of the instance.
You can also use yarn start-cli for running the CLI and requesting estimates directly within the terminal.

Congratulations! You now have successfully created a CCF app running in a virtual machine.

You may notice that if you exit the SSH or terminal session, that the running process will not persist. In this case, you can use a tool such as a Screen to create an uninterruptible terminal session to run your app in. To do so, try running the following command:

screen -S ccf

This will create a new Screen session called “ccf”. From here, you can run one of the yarn start commands to run your app and then use ctrl+a ctrl+d command keys to detach from the session. The session will stay on in the background and your CCF app will stay running!

You can always reattach to the session by entering screen -r in your terminal.

If you’re more comfortable and don’t like the idea of having terminal sessions running in the background, you can also create a .service file to run your application as a background service instead.

What Now?

Now that you have an always-running CCF instance on a cloud-based virtual machine, you can now continue to explore both realtime and historical estimates for all of your services in the cloud. If you’d like to explore additional ways to enhance your CCF app, consider the following:

Running the CCF App with Docker
Creating a cron-triggered cloud function to automatically fetch new estimates
Configuring a MongoDB instance to persist new and historical estimate data
Using the CLI app to seed data into the configured cache option for your instance
Creating an internal dashboard for your team or organization to view estimate data

Hopefully you’ve found this walkthrough helpful and see that this is only the beginning of your cloud carbon footprint journey and taking steps to help create a greener cloud! For more walkthroughs and technical deep dives, make sure to keep following the CCF Blog and share your experience on our discussions page!

Forward​

Getting Started With On-Premise Data Collection​

The On-premise data model​

Data Sources​

Data Collection and Storage Technologies​

AWS Glue with AWS S3 Data Lake​

Django with Django-Rest-Framework, Celery-Beat, and PostgreSQL​

On-premise Data - Data Model​

Special Data Model Fields​

Collecting On-premise Data​

The Data Source - Antivirus​

Collecting the data​

Calculating upTime hours​

Conclusion​

Creating your EC2 Instance​

Setting Up Your CCF Instance​

Installing Node.JS​

Installing Yarn​

Installing CCF​

Running Your App​

What Now?​