Mobile Image Timeline Analysis with Splunk

Nearly every type of digital forensic analysis should include some type or ability of timeline analysis for purposes of correlating data or discovering new leads. In my day job, I perform this by looking for trends or patterns in network activity, or by determining additional actions that occurred (near a point in time) around an identified incident.

Splunk has become one of my favorite tools to use for this analysis. This is not because it does the work for me; Splunk is a very powerful application for aggregating large amounts of data and then parsing and querying that data. However, someone who queries data with Splunk needs to actually understand the data being ingested to be effective using it for analysis.

One reason I’m not always fond of using “powerful, paid-for, commercial tools” is that they don’t (always) require the analyst to thoroughly understand the data being ingested, which may lend itself to misunderstanding the processed results or drawing false conclusions. Another more obvious reason against using big “off-the-shelf” applications or appliances is the (monetary) cost, both to the organization as well as to the studying analyst.

Thus, I’ve decided to try my hand at incorporating a versatile tool I’ve come to appreciate, the free version of Splunk (with 500MB daily limit), with a subject I am currently studying, Mobile Device Forensics, which just like my own profession contains many shiny expensive tools. One such application that I have had the opportunity to learn about and have been impressed with is the Examine element of the Magnet AXIOM suite. Examine can process an operating system (OS) image from a mobile device and create a timeline of events, to include file creation, file access, file modification, and more. While this provides a powerful capability and satisfies a need to perform timeline analysis, a small business or organization may not have the resources or manpower to utilize the Magnet AXIOM suite.

Here, as a free and “lighter-weight” alternative, I will attempt to extract all EXIF data from an OS image, move it into a Splunk index, and perform some basic analysis of that data.

Image and Tools

This research will assume that the OS image has already been acquired. In this example, I will use the Android 10 Image (file system extraction) created and provided by Joshua Hickman (located here). I will specifically be using the Non-Cellebrite Extraction.

I used the following tools:

  • PowerShell Version 5.1.17763.134
  • ExifTool Version 10.11
  • Splunk Enterprise Version 8.0.3

Acquire Raw EXIF Data

To start my experiment, I’ll need to acquire the EXIF data from the image. ExifTool is a great command-line utility that can collect this information, but I was unable to get the built-in recursive functionality to process the entire image, so I recruited my friend Mike to hold my hand in working out a PowerShell script that accomplished the same function:

$imagepath = ‘G:\Pixel 3 Image\’

$path = (Get-ChildItem -Path $imagepath -Recurse -Force -ErrorAction SilentlyContinue).FullName

foreach ($i in $path)

{& ‘.\exiftool.exe’ -j -w G:\Pixel3RawEXIF\%f_%e.txt $i}

A few notes about this script:

  • Avoid problems by running this script from an elevated PowerShell prompt.
  • $imagepath identifies the source of the image.
  • “G:\Pixel3RawEXIF\” identifies the directory where the EXIF data will be exported to.
  • The script needs to be ran from the same directory as the location of “exiftool.exe”.
  • Ensure that the “exiftool.exe” file does not have the “-k” option in the filename.
  • The “-j” and “-w” options are ExifTool options and specify JSON format output and text file outputs, respectively. The “%f_%e.txt” are parameters for the “-w” option that will retain the original file name and extension and append a “.txt” to it for the created file (which contains the EXIF data for that specific file).

The runtime of this script will vary considerably depending on how many files are in the image, the type of hard drive the image is on, and computing resources available on the executing machine, but as a general rule this step will likely take many hours to complete. The Windows 10 virtual machine I used for this experiment had 16GB of RAM and 2 processors @2.20GHz, and took approximately 11.5 hours to process the ~13GB image.

Optionally, the script could be supplemented with the Measure-Command PowerShell cmdlet, which would provide statistics about execution runtime upon completion of the script.

Adding the cmdlet to my script as follows:

Measure-Command {

$imagepath = ‘G:\Pixel 3 Image\’

$path = (Get-ChildItem -Path $imagepath -Recurse -Force -ErrorAction SilentlyContinue).FullName

foreach ($i in $path)

{& ‘.\exiftool.exe’ -j -w G:\Pixel3RawEXIF\%f_%e.txt $i}

}

Yielded the following output when the script finished running:

Script execution time

Install and Configure Splunk

Now that I’ve acquired the raw data, I need to build a Splunk instance to process it. Splunk can be installed on Windows, Linux, or Mac OS. As an administrator, I prefer managing Splunk on Ubuntu, but for the sake of this experiment and only being concerned with analysis, I’ll install Splunk on Windows.

I downloaded Splunk Enterprise 8.0.3 from the Splunk download page (Login is required; accounts are free).

Once the MSI installer was downloaded, I ran through the installation.

From an elevated PowerShell prompt, start Splunk from the $Splunkhome\bin\ directory with the “.\splunk.exe start” command:

Start Splunk

The first time you start Splunk, you will have to agree to a license agreement and set the default credentials.

Once the instance is started, it will be accessible from an internet browser at the default port of 8000:

http://localhost:8000

Splunk login page at default port 8000

Define Timestamp Source

Next, I need to feed the data into Splunk. Many sources and formats of data can be processed by Splunk. In this experiment, I will be processing JSON data. Generally, JSON data could be fed into Splunk with no further configuration. In the case of my data, however, there is one issue I’ll need to address.

By default, Splunk will attempt to determine timestamps from data (this determination can get complex, and more information about the functionality can be found here), and if it is unable to (which was the case here), it will use the timestamps associated with the source file. My source files have timestamps that refer to the time they were created with my PowerShell script, but I want Splunk to use the timestamps contained within the data, which refer to the original files on the OS image.

There are several ways to accomplish this. In this particular instance, I will modify the Splunk configuration to specify how I want it to determine timestamps for JSON-formatted inputs. The configuration file I will change is props.conf. This file does not exist by default in the directory I need to modify it from, so I created it in the “$Splunk\etc\system\local\” directory. Then, I defined the following lines within the file:

[_json]
INDEXED_EXTRACTIONS = JSON
KV_MODE = none
AUTO_KV_JSON = false
TIMESTAMP_FIELDS = FileModifyDate
TIME_FORMAT = %Y:%m:%d%H:%M:%S

 The main field here that could vary based on what I want to analyze would be the value of the TIMESTAMP_FIELDS field. I have defined it to be the “FileModifyDate”, which is a JSON field within my source files, and contains it’s own value, which is the timestamp of the original file within the OS image.

Based on this props.conf configuration, when Splunk receives an input in JSON format, it will read the data contained within the file to locate a JSON field titled “FileModifyDate”, record the value referred to by that field (our original timestamp), format it according to the TIME_FORMAT format defined in props.conf, and use this final timestamp for the event (the source file) in it’s own index.

I restarted the Splunk process so that the new JSON processing instructions would take effect:

Syntax to restart Splunk

Create Index to Store Data

An index in Splunk is an isolated repository of data. I will create an index to store the data inputs:

  • Go to “Settings” > “Indexes”
    • Click “New Index”
    • Define the Index Name “pixel3exif”
    • Click “Save”

Define Data Input

The data inputs refer to the actual data that fills an index, such as the EXIF data my script created. I will define the data inputs (source data) to be fed into my new index “pixel3exif”:

  • Go to “Settings” > “Data Inputs”
    • Click “Files & Directories”
    • Click “New Local File & Directory”
    • Click “Browse”
      • Select directory
    • Click “Next”
    • Under Source type, click “Select”
      • Select Source Type
      • “Structured”
        • “_json”
    • Under App context, select “Search & Reporting (search)
    • Under Index, select “pixel3exif”
    • Click “Review”
    • Click “Submit”
    • Click “Start Searching”

Validate Data Indexing

At this point, Splunk has created the new index and started indexing the raw EXIF data. It may take several minutes or longer for all of the data to be indexed by Splunk.

In this instance, the EXIF data took approximately 15 minutes to be indexed and ready for analysis.

Perform Analysis

Now I can perform queries of the data with a focus on timelines. To execute queries in the Splunk search bar, I will specify the index to search as well as my search criteria. I’ll start with the following search:

index=”pixel3exif” spotify Directory=”G:/Android 10 Image with Documentation/Non-Cellebrite Extraction/Pixel 3/Pixel 3/data/data/com.spotify.music/cache/http-cache”

This search will yield results of accessed Spotify http cache files. From the query results, selecting “Visualization”, “Pivot” (for a pivot table), and then selecting “All Fields”, the results can be manipulated in several different ways. Selecting a “Line Chart” displays the data in a way that could provide a high-level overview of Spotify usage by day, perhaps to determine which days of the week or months that the application is used the most:

Spotify http cache file access times (by quantity)

Performing another search related to EXIF data from pictures can show both a total count as well as associated dates with pictures that were directly photographed:

index=”pixel3exif” SceneType=”Directly photographed”

Again, utilizing the pivot table visualization method yields the following line chart:

6 JPEG and 2 PNG (data not displayed) pictures were directly photographed on the dates indicated on the line chart.

When viewing all data within the index, a brief timeline of data is displayed in Splunk:

Selecting the two large blocks towards the right hones in on the majority of data:

Zooming in to the selection breaks down data by day:

The majority of the data from this image was accessed between January 29, 2020 and February 16, 2020

My next search string will query within this specified date range:

index=”pixel3exif” * | eval timeofday=case(date_hour>=5 AND date_hour<8,”Morning”, date_hour>=8 AND date_hour<18,”Working Hours”, date_hour>=18 AND date_hour<23,”Evening”, date_hour>=23 OR date_hour<5,”Middle of Night”, 1=1,”error”) | timechart span=1d count by timeofday

This string will evaluate the data based on time ranges throughout the day, which I have defined as a morning window, a working hours window, an evening window, and a middle of the night window. From the outputted data, I selected a line chart visualization, which finally displays:

Based on this initial chart, I can easily determine that most data accessed on this phone was during working hours, with the smallest portion accessed during the middle of the night. From here, I could pivot to one of the two spikes of the “middle-of-night” data:

There was an uncommon spike in file access in the middle of the night on February 14, 2020.

If I had supplemental leads or other data sources in an investigation to correlate, I could further investigate the cause of the “out-of-place” file access that occurred in the middle of the night on February 14, 2020.

Conclusion and Future Research

This brief introduction to Splunk EXIF analysis is a “drop in the bucket” of analysis that could be performed. There is great value in analyzing data from a “moment-in-time” standpoint, but plenty of other approaches could be utilized.

For data that contained much GPS information, visualizations could be created that demonstrated the dates, time-of-day, or frequency that geographic places were visited. There are also several Splunk applications that utilize Google Maps data to automatically plot GPS locations with the correlating data.

Another possible approach would be analyzing data from multiple images or sources at once. If several mobile device images belonging to a group of people being investigated were all queried together, a query could be created that identified all points in time (with associated data) where the GPS data across multiple devices was within a specified physical range of each other, such as to demonstrate when they came into physical contact with each other. The same concept could be utilized to determine if third-party applications were communicating with each other at specific points in time.

Finally, much data in mobile device images is contained within XML databases. Splunk is capable of ingesting this data, and if the leg-work of interpreting to Splunk the data formats for ease of querying was performed, these databases would significantly increase the capabilities of searches and timeline analysis utilizing these methods.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Create your website at WordPress.com
Get started
%d bloggers like this: