Hi there! Welcome to part 5 of the series where I explore my cable modem data with different data analytics platforms and see what happens. In my previous posts I have built a dashboard for my cable modem data using tools that I was not familiar with. This post is going to be different. The first difference is that I ran out of pumpkin cookies, and I will have to find a new reward for my milestones. The more significant difference is that in the previous posts I had very little experience with the tools before I tried to build the dashboard. This time I am using Splunk to do the analysis and visualization of the data. I have been using Splunk for several years. To keep with the “doing something new” theme of this series I am going to try out the new Dashboards Beta app and while I am at it, I will be building more analytics into the dashboard to try to separate normal from abnormal values.
Getting data in, with a different format.
When I started this project, I initially chose to send the data to Splunk in JSON format as a text event. Text based event data is typically what I work with in Splunk. In the last few months, I have spent more time with Splunk’s metric indexes. Metric indexes are very fast but require minor sacrifices in flexibility while searching the data. Being the over-achiever that I am, I decide to do this project focusing on metrics indexes to get more experience with them. The changes to the Python script that sends the data to Splunk was fast and easy, thanks to helpful documentation from Splunk.
The data starts out as a piece of JSON that looks like this:
I have highlighted the values that will be the metric measurements. The fields that are not highlighted are going to be dimensions used to filter the data during searches. The Modulation and LockStatus are used at search time as filters, the Index field is a common data grouping field that is the channel index rather than the data-storage index. The process for sending the data as an event or a metric index using the Splunk HTTP Event Collector is virtually the same.
As some people may be aware, I am somewhat opposed to thoroughly reading documentation. I skim and cherry-pick what I need most of the time. Occasionally this leads to learning things the hacky way, the bad way, and the wrong way, before resorting to reading the documentation and solving the problem the easy way. In this case, skimming the documentation hasn’t come back to bite me, at least not yet. With the newly modified script, I backfilled 3 months’ worth of data into a fresh Splunk Docker instance. As with other posts in this series, getting the data in is the easy part. The real fun starts once I start to work with the data.
Getting started with the dashboard
When I start a new Splunk dashboard, I usually create a blank dashboard, head over to the Search tab, build my search, customize all the parts need to be adjusted for my use case, and save that search as a dashboard panel. With the new Dashboards Beta app, the process is a little different, I cannot save a search to a dashboard. Charts and panels are created within the dashboard’s edit interface. The experience is pretty good. The interface is a little complicated, but with so many options to customize it is surprisingly intuitive. Code views are ready wherever they are helpful for advanced options that are not available in the UI.
The first chart on my new dashboard is a single-value metric that is powered by simple search to show the average signal to noise ratio on the downstream channels over the last 24 hours.
Sizing the panel, and positioning it on the dashboard gives me a little trouble, but I can switch into the code view and set the values that I want, and just like that we are off and running with the first panel on a brand new dashboard.
After a few hours of messing around with layout and options that I can customize, I have made only minimal progress. Some of this is because I cannot make up my mind about what I want to do. Some of this is because the new Dashboards Beta layouts are not very intuitive to me. I like that I can move the trendline to various sides of the single-value panel. I like that I can change the background colors, change the font size for the value, and the trend indicator, and make other changes to the charts easily. I am not so fond of the layout and sizing aspect with the grid layout of the new dashboards. I spent a couple of hours working with it and only have this rather boring dashboard to show for it.
I take a break and eat a salad while I read some documentation. The documentation keeps showing options in the editor that I don’t have, keyboard shortcuts to move things around, ability to draw shapes and move things with precision. I want these things, I need these things, or I will be doomed to boring dashboards for the rest of eternity – I have a Snickers because I had only a salad for lunch and I am being overly dramatic. I check the version that the docs were written for, and the version of the Dashboards Beta app that I installed, they match. I must be missing something. I create a brand new empty dashboard using absolute positioning instead of grid positioning, and a whole new world opens before my eyes. Tool bar buttons appear, drag and drop works, I have keyboard shortcuts to nudge a panel 1 or 10 pixels at a time, it’s all so glorious. A rainbow appears, the clouds part revealing a sunbeam around my desk, while a choir of angel-winged Buttercups sing in harmony. What was in that salad I wonder.
My earlier struggles can be traced to a decision I made when I created the empty dashboard without knowing the ramifications. Like running unknown code, or copying answers from StackOverflow, this is bad. I know it, I do it sometimes, I feel appropriately bad about it when it happens. Let’s back up a step and explore how I messed this up, and maybe save someone else some time. Creating a new dashboard has always been easy, couple clicks, type a name, pick some permissions, that is all. With the new Dashboards Beta app, the same is true, except there is a new radio button “Layout” with Grid and Absolute options:
That little yellow arrow points to a popup that had I taken the time to read I would have known what I was doing.
Choose grid layout to quickly create a pleasing dashboard with an easy layout. Charts are the only visualizations available.
Choose absolute layout to access features such as pixel perfect control, shapes, icons, and image uploads.
As lessons in reading documentation go, you might think I have learned this lesson the hard way enough times that I would stop doing it. You would be wrong, but it is a logical thought. Speed matters, brute force till it works, I will sort it out in post, that’s how I roll. Right up until I hit production environments. In production I am very paranoid about making any changes or breathing in the general direction of anything at all. I nuke 3 paragraphs from this post complaining about grid layout, and add yet another story about not reading documentation, and I start over.
That’s looking pretty good. Spacing is under control, I have a lot of flexibility on colors, and the panels make some sense. I learned how to do a chain search, which is the new terminology for post process searches where a base search is run, then the results are used for different panels. Chain searches can only be set up from the code view, but that was true for post-process searches in regular Splunk dashboards.
Already we have 2 questions. The signal to noise value is 35.74 dB. Is that normal, is it bad, what does it mean? Same for the power values for upstream and downstream. While the power level seems to correlate for upstream and downstream, I wonder if there is deeper value that we can extract. The second question relates to the Unerrored code words. Very consistent spikes downward every few hours.
What is normal?
This is a question that comes up a lot when working with data. The power levels seem to follow a daily curve, increasing during the day, falling off during the night. The SNR value looks to be stable throughout the day but does spike occasionally. I can use Splunk to help me gather some information and develop my theories. Establishing what is normal is a job where machine learning is frequently called in. If I was a data scientist, I would be looking for an algorithm to fit the data, and then use that algorithm to make predictions and draw conclusions. Splunk has the machine learning toolkit, ITSI, and other features to help me figure it out, but that is for another blog series. We’re going to assume that before machine learning was a thing, people were able to do analysis effectively. I am going to compare values over time using some basic math and some SPL.
The strategy I am going to use depends on some things being true. First, there must be predictable fluctuation that occurs based on hour of day, and day of week. Users start logging on in the morning, and it tapers off in the late afternoon – except on Saturday and Sunday where the pattern still exists but has a lower magnitude. Comparing user activity on a Monday morning to a Saturday morning yields results that are not usually helpful, comparing Monday morning to previous Monday mornings does get us some good information.
I think that slicing data by time is harder to explain than it is to do. The search can be complex if you see the completed search first, but it is simple when you are building it up line by line. To get started I am looking for a timestamp rounded to the hour, and a measurement value for that hour. In my case, this is the average of the downstream power level values. With that foundation block of data, I am going to add new fields for the day of week, and hour of day to the hourly stat values.
Now that the data has been tagged for day of the week and hour of the day, a stats command is used to group the data by day of week, and hour of day, computing the average, standard deviation, and 90th percentile for the values in groups. All of the magic happens with this stats command, by grouping by day of week and hour of day, we have the current value, and the values that we can compare it to. I also add a count so that I can see how many weeks of data is included.
Next to filter down to only the data that we need, I am going to add the current day of week, and current hour into the data set, and filter to only the records where the day of week matches the current day of week.
In just a few steps I have a search that gives me the current power level, and the average for that value over the last 13-14 weeks. People are generally terrible at interpreting numbers, which is why we have charts. Looking at the chart below, I can see the current values against the historical average, and I see that the downstream power level is running a little under the average.
The beauty of this pattern is that it is very flexible once you understand how it works. The hard part is getting the data into that initial shape with hourly metric values. Sometimes you must do inline parsing and transformations before you can do the slicing which can make the search more complex.
Another common challenge is performance. With my use case, I am looking at 16 weeks of data, which is about half of a million values. The search takes less than 1 second. If this were a large data source, with millions of events per hour, pulling 16 weeks of data would be very slow and may not even work correctly due to memory limits. When I have run into this challenge in the past, I have used summary indexing to get around it. To do summary indexing, a scheduled search runs periodically to squash the data and produce the hourly results, storing only the hourly data in a summary index. The dashboard search then uses the live/current data and joins the data that was summary indexed. The methods for the join vary from use case to use case, but it is possible. The only tricky part is to make sure that the summary index search is resilient and backfilling data is handled correctly.
Adding my historical comparison charts to the dashboard is easy, all of the work is handled with the search itself. The dashboard gives me a good snapshot of what is going on, and the color coding helps separate different metric groups.
There are still a few questions that are left to be answered. I have not addressed the Unerrored Codewords spikes, but I am pretty sure that is an artifact of the modem hitting the upper boundary of the variable type that it storing that value in. Looking at the last values before the spike should tell me what that is about.
I have figured out that the power levels are a little below average, information that I didn’t have before. I don’t yet know why the values are lower than average. I think the pattern I am seeing with the power levels is related to the weather. I have a script collecting hourly weather data for my area that has been running for a few months. That script was built to get data for machine learning and AI experiments, but that data might have some use here as well. Often it is necessary to combine data from different sources and different purposes to get more insights. There will always be more questions, and more data to play with to answer those questions.
With my dashboard created, I have reached the end of this post. The new Dashboards Beta app was fun to work with once I got used to it. I was able to get good information out of raw data and learn some new things along the way. I’ll be back in few weeks with the last post in this series, wrapping up this project, and figuring out what I learned over the past several weeks.
This blog was written by Greg Porterfield, Senior Security Consultant at Set Solutions.