Making Sense of Your Data With Python
About
Communication is a massive part of any good security program. Being able to demonstrate successes, failures, and areas of improvement for organizational security posture is critical to receiving buy-in and resources.
One of the best methods for demonstrating security posture is to provide metrics from existing security controls, but this often easier said than done. Many security tools flood us with data, but don’t provide the means to interpret data effectively.
This kind of work is often done through Excel macros and pivot-tables. This method is perfectly fine on a small scale, but can be very difficult to automate and does not scale well. Python can be a great alternative for security analysts to better understand what their tools are telling them, once the learning curve has been surmounted.
What I Will Cover
- Pandas Library: one of the premier Python libraries for analyzing and curating data.
- Python Dataclasses: a standard library in Python > 3.7 introduced in PEP 557.
Intro to Pandas
Pandas is a great tool for programmatically handling spreadsheets with Python. Pandas can easily convert formats like CSV, Excel, and Python Dictionaries to Pandas DataFrames. A Pandas DataFrame is a data structure that organizes data into rows and columns (series). This provides the flexibility to easily manipulate formatted data for analysis
Working with DataFrames
This section assumes that you have some knowledge of Python 3.x with topics like lists, dictionaries, and conditional statements.
Install Pandas
Use
pip install pandas
(orpip3 install pandas
if pip for Python 2.x is your system default)Create your file and import Pandas Create a file, name it something like
analytics.py
orcsv-parsing.py
. Import Pandas by adding this line of code to the top of your fileimport pandas
``
Create or upload your CSV File
intColumn,strColumn,dateColumn 1,someData,2-28-2021 3,someMoreData,3-13-2021 3,someMoreData,3-14-2021 5,someFinalData,1-18-2021
``
Load your data into pandas
dataFrame = pandas.read_csv("example.csv")
``
Your data is now loaded into a data frame that we can use for processing, the data frame can be visualized with the following table
intColumn (ints) | strColumn (strings) | dateColumn (datetime objects) |
---|---|---|
1 | someData | 2-28-2021 |
3 | someMoreData | 3-13-2021 |
5 | someFinalData | 1-18-2021 |
Now that your data is loaded, you can treat rows and columns in your CSV just like you would any other data in python, and you can mimic the functionality normally reserved for Excel macros without needing to load up Excel.
Here are some examples of some things you may need to do:
Output Unique Values From a Column
dataFrame = pandas.read_csv("example.csv")
## Using Pandas
vals = dataframe["intColumn"].unique()
print(vals)
## Using sets, if you want to add more data later without adding duplicates
vals = set(dataFrame["intColumn"].to_list())
print(vals)
Get The Average From a Column
## Using Pandas
avg = dataFrame["intColumn"].mean()
## Without Pandas, the manual way
avg = sum(dataFrame["intColumn"].to_list()) / len(dataFrame["intColumn"].to_list())
These are just some examples of how pandas methods can simplify data analysis tasks, please consult the docs for more advanced usage: https://pandas.pydata.org/pandas-docs/stable/index.html
Dataclasses
Dataclasses were added in Python 3.7 and are an extremely useful way to model data in an object-oriented fashion, without needing to work as heavily with Python’s OOP functionalities. Essentially, this allows you to more quickly get up-and-running with storing your data in a more easily-readable way.
A good example of a use case for dataclasses that I’ve run into in the wild comes when working with the spotty, inconsistent APIs that some proprietary security tools provide. When you’ve got different endpoints naming the same things differently, and data that is normally easy to find in the UI of your appliance instead hidden away in some poorly maintained module, representing your data as classes can help you limit coupling between the code collecting and normalizing the data, and the code processing the data.
Basic Usage
Below is a basic example of defining a Vulnerability data class, we can use type hinting and the @dataclass
decorator to skip over needing to create a constructor method
from dataclasses import dataclass
# Notice the lack of a need for a constructor!!
@dataclass
class Vulnerability:
# The colon notation is type hinting, its definitely worth checking out if it looks new to you: https://www.python.org/dev/peps/pep-0484/
cvss: float
# We can easily set default values as well!
cve: str = None
vuln = Vulnerability(cvss = 7.7, cve = "CVE-2021-36958")
print(vuln)
# Outputs: Vulnerability(cvss=7.7, cve='CVE-2021-36958')
Here is that same class without using dataclasses:
class Vulnerability:
# Notice the need for a constructor, this can become more cumbersome as you need to add more and more attributes to this class
def __init__(self, cvss: float, cve: str = None) -> None:
self.cvss = cvss
self.cve = cve
vuln = Vulnerability(cvss = 7.7, cve = "CVE-2021-36958")
print(vuln)
# Outputs: <__main__.Vulnerability object at 0x7f6aa801d4c0>
You’ll notice two main differences here:
- Less typing, and no need to repeat yourself when defining passing in constructor parameters
- The output of printing the dataclass is much more useful, rather than just giving us the memory address and object type.
Usage with Pandas
To combine the two things we’ve covered in this post, we are going to create a dataframe made up of dataclasses converted to dictionaries. A common use case (and one I’ve run into in the wild) is needing to output the results of some analysis run on dataclasses as a CSV, I’ll show that example here:
from dataclasses import dataclass, asdict
import pandas
@dataclass
class Vulnerability:
cvss: float
cve: str = None
print_nightmare = Vulnerability(cvss = 7.8, cve = "CVE-2021-36958")
sequoia = Vulnerability(cvss = 7.8, cve = "CVE-2021-33909")
confluence = Vulnerability(cvss = 9.8, cve = "CVE-2021-26084")
# Construct a list of dictionaries using list comprehension and dataclasses asdict() method
cves = list(asdict(d) for d in [print_nightmare, sequoia, confluence])
df = pandas.DataFrame(cves)
df.to_csv("output.csv")
Dataclasses are very powerful when representing data as a class. They become cumbersome when attempting to fit a lot of functionality into a single class, but they can be very helpful for pure analysis.
e3fc2fd @ 2021-10-11