Threat Research with Snowflake & VirusTotal
There is no shortage of threat intelligence. Between numerous vendors and open sources, the amount of data can be overwhelming. The challenge then becomes curating what you already have and tailoring to your needs whether that’s for research, net defense, or both. Lacework Labs uses a number of different sources and tooling for this however we’ve derived the most value with a combination of Snowflake & VirusTotal; VirusTotal being the collection source, and Snowflake for data analysis and processing.
This blog is a how-to guide on collecting and analyzing threat intelligence with VirusTotal and Snowflake. We’ll describe how this applies to the phases of the intelligence life-cycle and include real-world examples showing how Lacework Labs is using Snowflake to extract indicators, and perform analysis tasks such as malware and attack-infrastructure clustering.
Only the key touchpoints of a collection process are covered in this blog. Details regarding deployment and scaling are out of scope.
Figure 1. Overview
Planning, Requirements Direction |
The following are required for duplicating the tasks in this blog. (Collection requirements are listed in the following section.):
- Snowflake datastore with read-write access role
- VirusTotal API access
- Python3 & Snowflake python connector
- Lacework Labs tools and resources
Two Snowflake tables will be needed: “vt_collection_relations” (or your desired table name) for storing the VirusTotal API responses and timestamp data & “vt_clustering” for tracking of AV name clusters. These can be created with the following commands:
create table vt_collection_relations( sha256 varchar(4096), start_time timestamp_ltz(9), end_time timestamp_ltz(9), etl_time timestamp_ltz(9), PROPS variant); create table vt_clustering ( sha256 varchar(4096), cluster varchar(4096), etl_time timestamp_ltz(9), PROPS variant);
Collection |
Requirements will vary depending on use-case, however Lacework Labs currently collects information on malicious ELF binaries and bash malware. Hashes for files matching this criteria can be returned with the following queries:
Example Query | Description |
tag:elf positives:1+ fs:2021-06-01+ |
Return hashes for files tagged as ‘elf’, with one or more AV detections and with a first_seen after June 1st 2021 |
tag:shell positives:1+ fs:2021-06-01+ | Return hashes for files tagged as ‘shell’, with one or more AV detections and with a first_seen after June 1st 2021 ‘tag:elf positives:1+ fs:2021-06-01+’ |
name:*.sh positives:1+ fs:2021-06-01+ | Return hashes for files with a .sh file extension, one or more AV detections and with a first_seen after June 1st 2021 .(Finds possible bash malware not tagged as shell files) |
The next step in the Virus collection is to query the data for files that meet our requirements. One of these data-types is called relationships. VirusTotal provides valuable relational information regarding malware specimens such as IPs and domains seen as either artifacts or in the communication or distribution of the malware. The following are useful relationship types:
Relationship identifier | Description |
itw_urls | Virus Total “In the Wild” URLs |
itw_ips |
IP addresses parsed from the ITW URLs |
contacted_ips |
IPs contacted during behavioral analysis |
contacted_domains |
Domains contacted during behavioral analysis |
embedded_urls |
Statically observable URLs |
Using version 3 of the VirusTotal API, these relationship types can be returned with a single API request. Example implementation in Python:
def query_VT_behaviorv3(sha256): url = 'https://www.virustotal.com/api/v3/files/'+sha256+'?relationships=embedded_urls,itw_urls,itw_ips,contacted_ips,contacted_domains' proxies=None timeout=None response = requests.get(url, headers={'x-apikey': api_key, 'Accept': 'application/json'}, proxies=proxies, timeout=timeout) if response.status_code != 200: print(response) raise return response.content
Processing |
Snowflake can natively ingest JSON data (up to 16MB ) using the Variant universal data type. This reduces the need for parsing and simplifies both the processing and analytics of data returned from the VirusTotal API. The following is an insert example. Ingestion of the JSON data is achieved using the parse_json function.
INSERT into vt_collection_relations select column1 as SHA256,column2 as START_TIME,column3 as END_TIME, column4 as ETL_TIME, parse_json(column5) as PROPS from values ('80689dc14c3e0dda6a1786945679b10228893ea133573e011d4dda6acb6a038a', '2021-06-30 18:05:50', '2021-06-30 18:05:50', '2021-07-01 14:30:07', '{\n "data": "this is example json!"}');Note – If using Python, it is important to write the raw Virus Total API response into Snowflake. If you convert to JSON using Python, then the resulting object will only be valid JSON for Python. For example, python renders null values as ‘None’. This may result in errors during Snowflake ingestion.
Figure 2. VirusTotal JSON – Loaded into Snowflake