IS THERE ENOUGH FIRE IN YOUR FIREWALL?

USE PYTHON, GEPHI, D3.JS, AND DIMPLE.JS TO EXPLORE CYBER VULNERABILITIES

by NATIONAL VULNERABILITIES DATABASE @ THE NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY
edited by STAR YING, JEFF CHEN, AND TYRONE GRANDISON, COMMERCE DATA SERVICE
As part of the Commerce Data Usability Project, NIST in collaboration with the Commerce Data Service has created a tutorial that uses National Vulnerabilities Data to explore cyber vulnerabilities. If you have question, feel free to reach out to the Commerce Data Service at [email protected].


How often should we update our software to stay secure?

That's the question Americans ask themselves daily. What is a good tradeoff between convenience and security when everything we use seems to require constant updates? How often are the vulnerabilities being discovered, exploited, or patched in the software that powers our lives? How much are we putting ourselves at risk when we ignore our updates?

As it turns out, the National Institute of Standards and Technology (NIST) maintains the National Vulnerabilities Database (NVD), which contains reported vulnerabilities according to their Common Vulnerabilities and Exposures (CVE) Identifier. With coverage back to 2002, the NVD is sponsored by Department of Homeland Security's National Cyber Security Division. Analysis conducted on listed vulnerabilities reveals insights into frequency at which vulnerabilities are reported and the risk to our everyday lives.

Seing is believing

Over time, reported vulnerabilities appear to fluctuate at random. For example, the the graph below of 2015 activity plots the number of reported vulnerabilities per day as well as the 'weighted threat' (the sum of the Common Vulnerability Scoring System score). The Common Vulnerability Scoring System (CVSS) is "an open framework for communicating the characteristics and severity of software vulnerabilities" published by the CVSS Special Interest Group, acting on behalf of FIRST.org. We can use the CVSS score as a measure of the threat posed by any one vulnerability. NIST also maintains a calculator for CVSS v2 at and will begin scoring with CVSS v3 in the fall of 2015. Every vulnerability on the NVD is scored according to the CVSS. We call the sum of the CVSS score for any given day the 'weighted threat' for that given day. Looking at the timeline below, we can see severe differences from day to day. For example on July 16, 2015, there were 159 vulnerabilities reported with a combined weighted threat of roughly 941. Other days barely see 2 vulnerabilities reported with weighted threats mostly under 50. Quite a large margin day to day.

Vulnerabilities reported in 2015

Is there a seasonal effect to the reporting of vulnerabilities?

Often times, patterns only emerge when aggregating data. Given such large differences day to day, what about changes from week to week? How much can the number of vulnerabilities reported vary from week to week? We now compare 2015 against the last four years on a weekly basis. Looking below, the difference in reported vulnerabilities still varies greatly but the we do see consistent peak weeks across several years. It seems that certain weeks on a quarterly basis peak or dip consistently across years.

It would seem on an initial look there is a slight seasonal effect every quarter. This would suggest that at the bare minimum, we should update our software every quarter before the next spike in reported vulnerabilities comes in.

annual time series for the past five years

What platforms are being reported on?

Given the sometimes large threat from reported vulnerabilities in a day, what platforms are being targeted? Since most days have zero reported vulnerabilites, we look at a weekly snapshot and plot a network graph of affected platforms. Sizing each vulnerability by their CVSS score, we can see what are the biggest targets from our timeline above.

Looking at the network graph, we can see that vulnerabilities are often very platform specific and on any given week we can see numerous platforms with new reported vulnerabilites. For example, looking at the spike in Week 4 (week of January 19th) below we can see Oracle had very bad Wednesday. The lion's share of reported vulnerabilities that week were targeted towards Oracle. Likewise, the spike in week 16 (week of April 13th) above corresponds to a busy Tuesday for Microsoft and Linux and a busy Thursday for Oracle. They received the bulk of vulnerabilities reported that week.

network graph of vulnerabilities and platforms

Choose a week:

But what platforms over the year are reported on and patched the most?

Our weekly snapshot reveals that for any given week of the year, five or more platforms could have vulnerabilities reported. But what are the most targeted platforms overall for 2015? To answer this, we look at the top ranking platforms in terms of vulnerabilities reported. Are the platforms being targeted ones people never use? Does conventional wisdom hold? Does Apple receive less reported vulnerabilities than Microsoft? Does Linux receive even less?

larger picture

Looking below, populous platforms obviously have more reported vulnerabilities. The difference in reported vulnerabilities between Apple and Microsoft is minimal over 2015. Vulnerabilities for both platforms are being reported and patched to roughly the same consistency. Linux receives much less compared to the other two major operating systems suggesting it has much less focus on it thant Apple or Microsoft. We can also look at the interaction between platforms which up to this point we have not been able to capture. Adobe, Canonical, and Novell are closely tied with major platforms such as Apple, Microsoft or Cisco. Vulnerabilites reported for former are often reported for the latter.

Okay, so how often should I update?

So how often should we update? Our exploratory look at the NVD suggest that at the very least every quarter. We see that there is a slight seasonality to when reports come in. Of course our look did not extend into the associated threat of each vulnerability very much. More severe vulnerabilities would necessitate a higher update frequency. But how frequent? The answer is determined by each person's individual risk tolerance. Providing these tools will allow for each individual to more accurately assess their own course of action.

Getting Started

In addition to assessing cybersecurity risks, the NVD data can also be used to identify vulnerability patterns across platforms and seasonality. To give a head start, we will cover the following basic steps in analyzing NVD and creating data visualizations in Python:

  • Parsing NIST NVD 2015 XML. Extracting relevant into a manipulable format;
  • Creating timeline for vulnerabilities;
  • Creating network list for weekly view; and
  • Creating edge and node list for annual view.

Follow along in the iPython Notebook below or check out the code files at the Github repo (https://github.com/CommerceDataService/tutorial_nist_nvd).
Jupyter Notebook Version

Creating Inputs

Importing the Necessary Packages into Python for processing

We first import the necessary packages to process the NIST National Vulnerabilities Database (NVD) for 2015. The xml file is available here: NIST NVD Website. The packages used are:

  1. re for parsing the xml with regex
  2. csv for exporting to comma separated variable file
  3. json for exporting to JSON
  4. datetime for determining week number of year and day number of week from timestamp
  5. (Optional) tqdm for creating a cheap progress bar on an iterable.
import re, csv, json, datetime
from tqdm import *

Defining a function to extract necessary information

We first define a function that can parse the NIST NVD xml file and extract the relevant information. We parse out more fields than is needed for our visualisations. The additional information can be used in the future to create more insights into the NIST NVD. The function returns a dictionary by vulnerability key as defined by https://cve.mitre.org/. All additional information is then stored in a dictionary for that particular vulnerability.

def pull_nvd(fname):
    print 'Loading NVD Dataset: %s'%fname
    nvd_dict = {}
    with open(fname) as f:
        for ln in tqdm(f):
            if 'entry id=' in ln:
                vuln_id = re.search('CVE.\d\d\d\d.\d\d\d\d', ln).group(0)
                # print '************************%s************************'%vuln_id
                nvd_dict[vuln_id] = {}
            elif 'cpe-lang:fact-ref name=' in ln:
                nvd_dict[vuln_id]['vuln_os'] = []
                nvd_dict[vuln_id]['vuln_os_plat'] = []
                vuln_os = re.search('\".*\"', ln).group(0)[8:-1]
                vuln_os_plat = re.split('[:\"]', ln)
                # print '--->%s'%vuln_os
                nvd_dict[vuln_id]['vuln_os'].append(re.sub('_',' ',vuln_os_plat[4]))
                nvd_dict[vuln_id]['vuln_os_plat'].append(re.sub('_',' ',vuln_os_plat[4])+' '+re.sub('_',' ',vuln_os_plat[5]))
            elif '' in ln:
                vuln_date = re.search('\d\d\d\d.*<', ln).group(0)[:-1]
                date = [int(vuln_date[0:4]),int(vuln_date[5:7]),int(vuln_date[8:10]), int(vuln_date[11:13]), int(vuln_date[14:16]), int(vuln_date[17:19])]
                date_num = datetime.date(date[0],date[1],date[2]).isocalendar()
                datetime_num = datetime.datetime(date[0],date[1],date[2],date[3],date[4],date[5]).strftime('%Y%m%d %X')
                # print '    --->%s'%vuln_date
                nvd_dict[vuln_id]['vuln_date'] = date_num
                nvd_dict[vuln_id]['vuln_datetime'] = datetime_num
            elif '' in ln:
                vuln_score = re.search('\d?\d.\d', ln).group(0)
                # print '    --->%s'%vuln_score
                nvd_dict[vuln_id]['vuln_score'] = vuln_score
            elif '' in ln:
                vuln_vector = re.search('>.*<', ln).group(0)[1:-1]
                # print '    --->%s'%vuln_vector
                nvd_dict[vuln_id]['vuln_vector'] = vuln_vector
            elif '' in ln:
                vuln_compl = re.search('>.*<', ln).group(0)[1:-1]
                # print '    --->%s'%vuln_compl
                nvd_dict[vuln_id]['vuln_compl'] = vuln_compl
            elif '' in ln:
                vuln_auth = re.search('>.*<', ln).group(0)[1:-1]
                # print '    --->%s'%vuln_auth
                nvd_dict[vuln_id]['vuln_auth'] = vuln_auth
            elif '' in ln:
                vuln_confid = re.search('>.*<', ln).group(0)[1:-1]
                # print '    --->%s'%vuln_confid
                nvd_dict[vuln_id]['vuln_confid'] = vuln_confid
            elif '' in ln:
                vuln_integ = re.search('>.*<', ln).group(0)[1:-1]
                # print '    --->%s'%vuln_integ
                nvd_dict[vuln_id]['vuln_integ'] = vuln_integ
            elif '' in ln:
                vuln_avail = re.search('>.*<', ln).group(0)[1:-1]
                # print '    --->%s'%vuln_avail
                nvd_dict[vuln_id]['vuln_avail'] = vuln_avail
            elif 'vuln:reference href=' in ln:
                vuln_link = re.search('href=.*\"', ln).group(0)[6:-1]
                # print '    --->%s'%vuln_link
                nvd_dict[vuln_id]['vuln_link'] = vuln_link
            elif '' in ln:
                vuln_summ = re.search('>.*<', ln).group(0)[1:-1]
                # print '    --->%s'%vuln_summ
                nvd_dict[vuln_id]['vuln_summ'] = vuln_summ
            elif '' in ln:
                vuln_source = re.search('>.*<', ln).group(0)[1:-1]
                # print '    --->%s'%vuln_summ
                nvd_dict[vuln_id]['vuln_source'] = vuln_source
    return(nvd_dict)

Run the function we defined

Now run the defined function on the NVD 2015 xml file extracted to the working directory.

nvd = pull_nvd('./nvdcve-2.0-2015.xml')

Create a timeline for 2015 so far

We create a timeline for 2015 so far by count for both the unweighted instances and the weighted totals given the CVSS score provided. Both the amount of vulnerabilities reported and the evaluated threat they pose on that day is of interest. We compute a running sum for both to see how much of a difference there is day to day.

nvd_timeline = [[],[]]
for cve in tqdm(nvd.keys()):
    if 'vuln_score' in nvd[cve].keys():
        weightedme = {'Method':'Weighted', 'Date':nvd[cve]['vuln_datetime'], 'Value':float(nvd[cve]['vuln_score'])}
        countme = {'Method':'Count', 'Date':nvd[cve]['vuln_datetime'], 'Value':1}
        if nvd[cve]['vuln_datetime'] not in [ x['Date'] for x in nvd_timeline[0] ]:
            nvd_timeline[0].append(weightedme)
            nvd_timeline[1].append(countme)
        else:
            for i, day in enumerate(nvd_timeline[0]):
                if nvd[cve]['vuln_datetime'] in day['Date'] and day['Method'] == 'Weighted':
                    nvd_timeline[0][i]['Value'] += float(nvd[cve]['vuln_score'])
                elif nvd[cve]['vuln_datetime'] in day['Date'] and day['Method'] == 'Count':
                    nvd_timeline[1][i]['Value'] += 1

Export our newly structured data to the local working directory as a JSON. This is now an input for our d3.js + dimple.js scatter plot.

with open('./nvd_timeline.json', 'wb') as f:
    json.dump(nvd_timeline, f)

Create a count of vulnerabilities per platform

Now we are interested in the number of vulnerabilities reported by platform. Does conventional wisdom hold up? Does Apple receive less reported vulnerabilities than Microsoft? Does Linux receive even less? To answer this, we create a new entry for each new platform and do a running sum as we parse through our defined dictionary. We then sort the resulting list and add a header to make loading in javascript faster.

nvd_plat = []
for cve in tqdm(nvd.keys()):
    if 'vuln_score' in nvd[cve].keys():
        for os in nvd[cve]['vuln_os']:
            appendme = [os, 1]
            if os not in [ x[0] for x in nvd_plat ]:
                nvd_plat.append(appendme)
            else:
                for i, plat in enumerate(nvd_plat):
                    if os == plat[0]:
                        nvd_plat[i][1] += 1

nvd_plat_sort = [['plat','count']]
nvd_plat_sort.extend(sorted(nvd_plat))

Export our (again) newly structured data this time to a CSV in the local working directory. We show how different file types can be exported from Python. We can then use our exported file as an input for d3.js + dimple.js histogram.

with open('./nvd_plat.csv', 'wb') as csvfile:
    writeme = csv.writer(csvfile, delimiter=',')
    writeme.writerows(nvd_plat_sort)

Create network graph with weekly information

Now we want to look at what platforms are being attacked for each week. We do this by creating a dictionary of nodes and edges with the additional week number, day number, and threat score loaded. Then given the dictionary, export to JSON file.

nvd_json = {'nodes':[],'links':[]}
for cve in nvd.keys():
    if 'vuln_os' in nvd[cve].keys():
        append_vuln = {'name':cve, 'group':nvd[cve]['vuln_date'][2], 'week':nvd[cve]['vuln_date'][1], 'threat':float(nvd[cve]['vuln_score']), 'type':1}
        nvd_json['nodes'].append(append_vuln)
        for os in nvd[cve]['vuln_os']:
            append_plat = {'name':os, 'group':nvd[cve]['vuln_date'][2], 'week':nvd[cve]['vuln_date'][1], 'threat':0, 'type':0}
            if append_plat not in nvd_json['nodes']:
                nvd_json['nodes'].append(append_plat)
            nvd_json['links'].append({'source':cve, 'target':os, 'group':nvd[cve]['vuln_date'][2], 'week':nvd[cve]['vuln_date'][1], 'threat':float(nvd[cve]['vuln_score'])})

We now export to JSON for our d3.js network graph.

with open('./data/nvd.json', 'wb') as f:
    json.dump(nvd_json, f)

Build inputs for Gephi (Big Annual Network Graph)

Weekly slices are important but we also want to see the whole network of vulnerabilities and platforms for 2015 at once. The number of nodes (platforms and vulnerabilities) and edges are too big for an interactive d3.js visualization. We instead use Gephi. Gephi requires an edge list and (optionally) a node list (if unconnected nodes exist). We create both in this exercise as list arrays with headers filled in. Then populate each row with either the edge or node. Finally, writing both files to csv files in working directory.

nvd_edge = [['source','target','value']]
for cve in nvd.keys():
    if 'vuln_score' in nvd[cve].keys():
        threat = nvd[cve]['vuln_score']
    else:
        threat = 0.0
    if 'vuln_os' in nvd[cve].keys():
        cve_type = (nvd[cve]['vuln_vector'], nvd[cve]['vuln_compl'], nvd[cve]['vuln_auth'], nvd[cve]['vuln_confid'], nvd[cve]['vuln_integ'])
        for plat in nvd[cve]['vuln_os']:
            appendme = [cve_type, plat, threat]
            if appendme not in nvd_edge:
                nvd_edge.append(appendme)

                nvd_node = [['id','value','type']]
for row in nvd_edge:
    append_vuln = [row[0],row[2],0]
    append_targ = [row[1],0.0,1]
    if append_vuln not in nvd_node:
        nvd_node.append(append_vuln)
    if append_targ not in nvd_node:
        nvd_node.append(append_targ)

We now export them to CSV for Gephi.

with open('./nvd_edge.csv', 'wb') as csvfile:
    writeme = csv.writer(csvfile, delimiter=',')
    writeme.writerows(nvd_edge)

with open('./nvd_node.csv', 'wb') as csvfile:
    writeme = csv.writer(csvfile, delimiter=',')
    writeme.writerows(nvd_node)

Looking at other years of the NVD

Given our look at NVD in 2015, we want to expand our look to past years. We do this by expanding our original dictionary to go beyond 2015. We load previous years of the NVD with the same function and update the original dictionary we defined. Then count for each week how many vulnerabilities are reported for each year.

nvd.update(pull_nvd('./nvdcve-2.0-2011.xml'))
nvd.update(pull_nvd('./nvdcve-2.0-2012.xml'))
nvd.update(pull_nvd('./nvdcve-2.0-2013.xml'))
nvd.update(pull_nvd('./nvdcve-2.0-2014.xml'))

Now with our newly beefed up dictionary that includes 2011 to 2015 datasets from NVD, we start processing to create time series for all five years. To do this we look at iso week number for any given year and conducting a running sum of the number of vulnerabilities reported. Commented out is an alternative look at the combined CVSS score for that week.

nvd_timeline_five = []
for cve in tqdm(nvd.keys()):
    if 'vuln_score' in nvd[cve].keys():
        appendme = [nvd[cve]['vuln_date'][1],0,0,0,0,0]
        if nvd[cve]['vuln_date'][0] == 2015:
            colnum = 1
        elif nvd[cve]['vuln_date'][0] == 2014:
            colnum = 2
        elif nvd[cve]['vuln_date'][0] == 2013:
            colnum = 3
        elif nvd[cve]['vuln_date'][0] == 2012:
            colnum = 4
        elif nvd[cve]['vuln_date'][0] == 2011:
            colnum = 5
        if nvd[cve]['vuln_date'][1] not in [ x[0] for x in nvd_timeline_five ]:
            # appendme[colnum] += float(nvd[cve]['vuln_score'])
            appendme[colnum] += 1
            nvd_timeline_five.append(appendme)
        else:        
            for i, day in enumerate(nvd_timeline_five):
                if nvd[cve]['vuln_date'][1] == day[0]:
                    # nvd_timeline_five[i][colnum] += float(nvd[cve]['vuln_score'])
                    nvd_timeline_five[i][colnum] += 1

Like before we want to sort the resulting structured data and attach a header to it. This reduces processing on the javascript end.

nvd_timeline_five_sort = [['week','2015','2014','2013','2012','2011']]
nvd_timeline_five_sort.extend(sorted(nvd_timeline_five))

Now we can export it to a CSV for our dygraph.js plot.

with open('./data/nvd_timeline_five.csv', 'wb') as csvfile:
    writeme = csv.writer(csvfile, delimiter=',')
    writeme.writerows(nvd_timeline_five_sort)