Immutable ledger-based security bigdata analytic system

 Immutable ledger-based security bigdata analytic system


The proposed system has focused on both batch and real-time log analysis use cases. The audit trails can be analyzed in a batch mode at automated regular intervals or manually and in a near real-time manner using the application. The serverless architecture which has been adopted while developing the system has improved the scalability of the system while providing a cost-effective audit log processing environment. This also removes the additional overhead of maintaining servers to run the application.
Using HiveQL on the EMR cluster, the data get queried from the storage and then stored on an external table to perform batch process against those data. The processed data then get stored on a cloud storage, in an output directory. Then the output data get queried and displayed on a dashboard for visualization.
To provide real-time log analysis and real-time anomaly detection, a server has been monitored by using a monitoring agent software application and those logs are get ingested to the Elasticsearch cluster and scan for anomalies of the data using an unsupervised machine learning model.


Application Workflow Diagram  


The following configurations have been used on the EMR cluster, which has been provisioned while developing the proposed system.
Environment: Development
Hive 2.3.6, Hue 4.4.0
Release label: emr-5.28.0
Hadoop distribution: Amazon 2.8.5
Availability zone: us-east-1d
Master: 1 m5.xlarge 

 

The auditors have the capability of enumerating the audit logs and uploading them to the cloud using the application. The exported ledger output gets stored on the S3 cloud storage. Once the processing is completed the output results get saved on the S3 cloud storage as a parquet compressed file and the EMR cluster get terminated automatically.  

Since the termination of the cluster causes loss of metadata, the system transfers the metadata to the AWS Glue data catalog. AWS Glue is used to prepare and load the data for analytics. System event logs, application logs, security logs from Windows/Linux systems have been used for analysis while developing the proposed system.
To demonstrate the real-time data analysis, an EC2 instance with the following configurations has been used as the source of the audit trails. IIS server logs, Event logs and VPC flow logs have been streamed to the cloudwatch from the server.
Each log groups then get streamed to the Elasticsearch cluster for the near realtime data analysis and anomaly detection. The audit trails have been stored in an immutable and verifiable ledger to protect the integrity and to assure accountability. Each record of the ledger considered as a document. Each document can be cryptographically verified and can identify all the alterations that have been taken a place on the specified document. 
The application provides the capability of exporting the ledgers and the exported journals have been used for the analytics, auditing and verification purposes. These ledger exports can also be used as a backup and for exporting to the other systems when needed  

High Level System Architecture Diagrams 

Storing and retrieving ledger data  

 
Batch processing log data  

 

Near Realtime log analysis  

 

The auditors have the capability of enumerating the audit logs and uploading them to the cloud using the application. The exported ledger output gets stored on the S3 cloud storage. Once the processing is completed the output results get saved on the S3 cloud storage as a parquet compressed file and the EMR cluster get terminated automatically. Since the termination of the cluster causes loss of metadata, the system transfers the metadata to the AWS Glue data catalog. AWS Glue is used to prepare and load the data for analytics. System event logs, application logs, security logs from Windows/Linux systems have been used for analysis while developing the proposed system.
Following configurations have been used when configuring the Elasticsearch cluster
while developing the application.
Number of nodes: 1
Number of data nodes: 1
Active primary shards: 32
Active shards: 32
Data nodes storage type: EBS
EBS volume type: General Purpose (SSD)
Instance type: (data) t2.small.elasticsearch
To demonstrate the real-time data analysis, an EC2 instance with the following configurations has been used as the source of the audit trails. IIS server logs, Event logs and VPC flow logs have been streamed to the cloudwatch from the server. Each
log groups then get streamed to the Elasticsearch cluster for the near real-time data analysis and anomaly detection.
Instance type: t2.micro
AMI ID: Windows_Server-2019-English-Full-Base-2019.10.09 (ami-
0d4df21ffeb914d61)
IIS web server version: 10.0
Monitoring agent: Unified cloudwatch agent

 


Comments

Popular posts from this blog

Cross Site Request Forgery attacks mitigation

Splunk ES CI/CD pipeline