Making Sense of Your Data With Python

About

Communication is a massive part of any good security program. Being able to demonstrate successes, failures, and areas of improvement for organizational security posture is critical to receiving buy-in and resources.

One of the best methods for demonstrating security posture is to provide metrics from existing security controls, but this often easier said than done. Many security tools flood us with data, but don’t provide the means to interpret data effectively.

This kind of work is often done through Excel macros and pivot-tables. This method is perfectly fine on a small scale, but can be very difficult to automate and does not scale well. Python can be a great alternative for security analysts to better understand what their tools are telling them, once the learning curve has been surmounted.

What I Will Cover

Pandas Library: one of the premier Python libraries for analyzing and curating data.
Python Dataclasses: a standard library in Python > 3.7 introduced in PEP 557.

Intro to Pandas

Pandas is a great tool for programmatically handling spreadsheets with Python. Pandas can easily convert formats like CSV, Excel, and Python Dictionaries to Pandas DataFrames. A Pandas DataFrame is a data structure that organizes data into rows and columns (series). This provides the flexibility to easily manipulate formatted data for analysis

Working with DataFrames

This section assumes that you have some knowledge of Python 3.x with topics like lists, dictionaries, and conditional statements.

Install Pandas
Use pip install pandas (or pip3 install pandas if pip for Python 2.x is your system default)
Create your file and import Pandas Create a file, name it something like analytics.py or csv-parsing.py. Import Pandas by adding this line of code to the top of your file
```
import pandas
```
``

Create or upload your CSV File

intColumn,strColumn,dateColumn
1,someData,2-28-2021
3,someMoreData,3-13-2021
3,someMoreData,3-14-2021
5,someFinalData,1-18-2021

Load your data into pandas

dataFrame = pandas.read_csv("example.csv")

Your data is now loaded into a data frame that we can use for processing, the data frame can be visualized with the following table

intColumn (ints)	strColumn (strings)	dateColumn (datetime objects)
1	someData	2-28-2021
3	someMoreData	3-13-2021
5	someFinalData	1-18-2021

Now that your data is loaded, you can treat rows and columns in your CSV just like you would any other data in python, and you can mimic the functionality normally reserved for Excel macros without needing to load up Excel.

Here are some examples of some things you may need to do:

Output Unique Values From a Column

dataFrame = pandas.read_csv("example.csv")

## Using Pandas
vals = dataframe["intColumn"].unique()
print(vals)

## Using sets, if you want to add more data later without adding duplicates
vals = set(dataFrame["intColumn"].to_list())
print(vals)

Get The Average From a Column

## Using Pandas
avg = dataFrame["intColumn"].mean()

## Without Pandas, the manual way
avg = sum(dataFrame["intColumn"].to_list()) / len(dataFrame["intColumn"].to_list())

These are just some examples of how pandas methods can simplify data analysis tasks, please consult the docs for more advanced usage: https://pandas.pydata.org/pandas-docs/stable/index.html

Dataclasses

Dataclasses were added in Python 3.7 and are an extremely useful way to model data in an object-oriented fashion, without needing to work as heavily with Python’s OOP functionalities. Essentially, this allows you to more quickly get up-and-running with storing your data in a more easily-readable way.

A good example of a use case for dataclasses that I’ve run into in the wild comes when working with the spotty, inconsistent APIs that some proprietary security tools provide. When you’ve got different endpoints naming the same things differently, and data that is normally easy to find in the UI of your appliance instead hidden away in some poorly maintained module, representing your data as classes can help you limit coupling between the code collecting and normalizing the data, and the code processing the data.

Basic Usage

Below is a basic example of defining a Vulnerability data class, we can use type hinting and the @dataclass decorator to skip over needing to create a constructor method

from dataclasses import dataclass

# Notice the lack of a need for a constructor!!
@dataclass
class Vulnerability:
	# The colon notation is type hinting, its definitely worth checking out if it looks new to you: https://www.python.org/dev/peps/pep-0484/
	cvss: float
	
	# We can easily set default values as well!
	cve: str = None

vuln = Vulnerability(cvss = 7.7, cve = "CVE-2021-36958")
print(vuln)
# Outputs: Vulnerability(cvss=7.7, cve='CVE-2021-36958')

Here is that same class without using dataclasses:

class Vulnerability:
	# Notice the need for a constructor, this can become more cumbersome as you need to add more and more attributes to this class
	def __init__(self, cvss: float, cve: str = None) -> None:
		self.cvss = cvss
		self.cve = cve

vuln = Vulnerability(cvss = 7.7, cve = "CVE-2021-36958")
print(vuln)
# Outputs: <__main__.Vulnerability object at 0x7f6aa801d4c0>

You’ll notice two main differences here:

Less typing, and no need to repeat yourself when defining passing in constructor parameters
The output of printing the dataclass is much more useful, rather than just giving us the memory address and object type.

Usage with Pandas

To combine the two things we’ve covered in this post, we are going to create a dataframe made up of dataclasses converted to dictionaries. A common use case (and one I’ve run into in the wild) is needing to output the results of some analysis run on dataclasses as a CSV, I’ll show that example here:

from dataclasses import dataclass, asdict
import pandas

@dataclass
class Vulnerability:
	cvss: float
	cve: str = None

print_nightmare = Vulnerability(cvss = 7.8, cve = "CVE-2021-36958")
sequoia = Vulnerability(cvss = 7.8, cve = "CVE-2021-33909")
confluence = Vulnerability(cvss = 9.8, cve = "CVE-2021-26084")

# Construct a list of dictionaries using list comprehension and dataclasses asdict() method
cves = list(asdict(d) for d in [print_nightmare, sequoia, confluence])

df = pandas.DataFrame(cves)

df.to_csv("output.csv")

Dataclasses are very powerful when representing data as a class. They become cumbersome when attempting to fit a lot of functionality into a single class, but they can be very helpful for pure analysis.