Project SynShade: Threat Classifier

Introduction

SynShade is an NLP-powered framework designed to classify malicious recruitment and advertising chatter from the dark web — specifically from Telegram channels associated with insider threat campaigns. While the current version (v1) handles binary classification (recruiting vs. advertising) thru the sigmoid function, the next iteration leverages the normalized exponential function (softmax), which introduces multi-class industry prediction, opening new dimensions for cyber threat intelligence (CTI). Outside of Telegram channels, you can also use SynShade on any valid dataset you might possess for training or analysis (X, Facebook, Discord, Reddit), assuming you took care of the scrapping/data cleaning part.

The trained model is available on Hugging Face: synshade-insider-threat-detector

Before we dig into the model training though, I think it’s important to highlight potential extensions of SynShade for CTI (Skip to “Problem: Detecting Insider Threats at Scale” if you don’t care about this).

Potential Extensions of SynShade for Cyber Threat Intelligence (CTI)

The underlying architecture of SynShade—based on bert-base-uncased and trained on domain-specific Telegram data—offers a flexible foundation for several advanced CTI use cases beyond insider recruitment detection. Here are some predicted applications:

1. Threat Actor Intent Classification
Extend the binary classifier into a multi-class system to identify the intent behind messages. Potential classes include recruiting, scamming, attack planning, tool sharing, and data selling. This enables better prioritization based on threat severity and actor intent.

2. Jargon and Codeword Decoding
Fine-tune the model to recognize evolving dark web slang and obfuscation techniques. This supports detection of concealed terminology used for malware, insiders, or exploits, making it valuable for real-time intelligence collection.

3. Threat Actor Profiling
Analyze linguistic patterns across messages to group users by tone, intent, or thematic focus. This helps in building psychological and operational profiles of threat actors, supporting attribution efforts.

4. Named Entity Recognition (NER)
Adapt the model to extract entities such as company names, job titles, tools, or targets from messages. This transforms unstructured chatter into structured intelligence that can feed dashboards or intelligence platforms.

5. Source Credibility Scoring
Train the model to assess the reliability of messages or sources based on content patterns, metadata, or engagement history. This helps analysts separate signal from noise and focus on high-fidelity intelligence.

6. Malware and Tool Detection
Classify messages that mention or distribute malware, exploit kits, information stealers, or C2 frameworks. This supports real-time tracking of tool distribution and early detection of new malware families.

7. Dark Web Market Intelligence
Extract information about illicit product categories, pricing, vendor behavior, and market trends. This application supports fraud intelligence, supply chain risk analysis, and market disruption efforts.

8. Attack Timeline Prediction
Classify message content into phases of an attack lifecycle: planning, execution, or post-incident. This helps in forecasting active campaigns and enabling preemptive defense measures.

9. TTP (Tactics, Techniques, Procedures) Detection
Label messages according to known techniques in frameworks such as MITRE ATT&CK. This enables automatic mapping of dark web discussions to specific adversary behaviors and campaign indicators.

Problem: Detecting Insider Threats at Scale

Dark web chatter related to insider recruitment is often buried under layers of slang, jargon, and obfuscation. Manual monitoring is labor-intensive and slow to scale. SynShade addresses this with a robust multi-task BERT architecture that automates classification while maintaining high accuracy, even in noisy, unstructured environments.

Methodology

Data Collection and Preparation

Data was collected from curated Telegram channels focused on insider recruiting or advertising. Each entry in the dataset includes:

Message (Msg)
Action Label (Recruiting or Advertising)
Industry Label (Technology, Retail, Finance, Telecom — planned for v2)

Model Architecture

Base Model: bert-base-uncased from Hugging Face Transformers
Tokenization: 128-token maximum length with padding and truncation
Multi-Task Output Heads:
- Binary classification for Action (Recruiting vs. Advertising)
- Multi-class classification for Industry (Planned for v2)

Training and Optimization

Environment: Google Colab Free (12GB RAM, GPU NVIDIA T4, 100GB DISK)
Framework: TensorFlow, integrated with Hugging Face
Optimizer: AdamW (learning_rate = 2e-5)
Batch Size: 32 (adjusted based on GPU memory)
Epochs: 3 (Model is set to run through 10 epochs with a patience of 2, meaning if the val_loss doesn’t improve after 2 epochs, the training stops early)
Regularization: Dropout (0.1), EarlyStopping based on validation loss

Computational Infrastructure

SynShade was trained on Google Colab Free using a T4 GPU. Resource usage (GPU memory, RAM) was actively monitored to dynamically adjust batch sizes. Model checkpoints were saved to Google Drive after each epoch to ensure reproducibility and allow recovery from session disconnects.

Problem Solved

This project enhances cyber threat intelligence capabilities by automating the classification of insider recruitment chatter from the dark web. Through BERT-based multi-task learning, SynShade supports scalable, real-time detection pipelines for cybersecurity teams. The approach bridges advanced NLP with cloud-based infrastructure to deliver a reliable and extensible monitoring tool.

Key Takeaways

NLP-driven detection of dark web insider chatter
Fine-tuning BERT on domain-specific Telegram messages enables precise threat classification.
Two-stage classification pipeline
Stage 1: Binary classification (recruiting vs. advertising). Stage 2: Industry targeting prediction (coming in v2).
Tailored for operational use
Preprocessing, expert labeling, and cloud-based training ensure SynShade is ready for real-world cybersecurity deployment.

Code: Mount, Install, and Import

Mount Google Drive

from google.colab import drive
drive.mount('/content/drive')

Install required libraries

!pip install -U "tensorflow-text==2.13.*"
!pip install "tf-models-official==2.13.*"
!pip install transformers

Import Modules

import os
import pandas as pd
from sklearn.model_selection import train_test_split
import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_text as text
from tensorflow.keras.callbacks import ModelCheckpoint, EarlyStopping
from tensorflow.keras import layers
from transformers import BertTokenizer

print(f"TensorFlow version: {tf.__version__}")
print(f"TensorFlow Hub version: {hub.__version__}")

Load BERT tokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

Preprocess Data

# Converts the Action column to binary format where Recruiting is 1, and other actions are 0.
def preprocess_data(df):
    df['Action'] = (df['Action'] == 'Recruiting').astype(int)
    return df

# Encodes the text data into token IDs, attention masks, and token type IDs using the BERT tokenizer.
# This step converts raw text into numerical format suitable for input to a BERT model.
def encode(texts, tokenizer, max_length=128):
    encodings = tokenizer(
        texts.tolist(),
        truncation=True,
        padding=True,
        max_length=max_length,
        return_tensors='tf'
    )
    return encodings['input_ids'], encodings['attention_mask'], encodings['token_type_ids']

# Converts the preprocessed DataFrame into a TensorFlow Dataset.
# This dataset includes tokenized inputs and labels, ready for model training.

def df_to_dataset(dataframe, shuffle=True, batch_size=32):
    dataframe = dataframe.copy()
    labels = dataframe.pop('Action')
    texts = dataframe.pop('Message')
    input_ids, attention_masks, token_type_ids = encode(texts, tokenizer)

    ds = tf.data.Dataset.from_tensor_slices((
        {
            "input_word_ids": input_ids,
            "input_mask": attention_masks,
            "input_type_ids": token_type_ids  # Changed from 'segment_ids' to 'input_type_ids'
        },
        labels.values
    ))
    if shuffle:
        ds = ds.shuffle(buffer_size=len(dataframe))
    ds = ds.batch(batch_size)
    return ds


# Load and preprocess data (adjust data path accordingly. Here, "My Drive" is your Drive main page)
data = pd.read_csv('/content/drive/My Drive/synshade/data.csv')
data = preprocess_data(data)  # Convert 'Action' to binary

train_data, temp_data = train_test_split(data, test_size=0.30, random_state=42)
val_data, test_data = train_test_split(temp_data, test_size=0.50, random_state=42)

# Save datasets to drive
train_data.to_csv('/content/drive/My Drive/synshade/train_data.csv', index=False)
val_data.to_csv('/content/drive/My Drive/synshade/val_data.csv', index=False)
test_data.to_csv('/content/drive/My Drive/synshade/test_data.csv', index=False)

print(f"Training set size: {len(train_data)}")
print(f"Validation set size: {len(val_data)}")
print(f"Test set size: {len(test_data)}")

Load Datasets and Prep Training

# Load datasets from disk
train_data = pd.read_csv('/content/drive/My Drive/synshade/train_data.csv')
val_data = pd.read_csv('/content/drive/My Drive/synshade/val_data.csv')
test_data = pd.read_csv('/content/drive/My Drive/synshade/test_data.csv')

batch_size = 32
AUTOTUNE = tf.data.AUTOTUNE

train_ds = df_to_dataset(train_data, batch_size=batch_size)
val_ds = df_to_dataset(val_data, shuffle=False, batch_size=batch_size)
test_ds = df_to_dataset(test_data, shuffle=False, batch_size=batch_size)

train_ds = train_ds.cache().prefetch(buffer_size=AUTOTUNE)
val_ds = val_ds.cache().prefetch(buffer_size=AUTOTUNE)
test_ds = test_ds.cache().prefetch(buffer_size=AUTOTUNE)

Create and compile model

def create_model():
    input_word_ids = layers.Input(shape=(128,), dtype=tf.int32, name='input_word_ids') # token IDs generated by the BERT tokenizer, representing the tokenized words in each input sentence.
    input_mask = layers.Input(shape=(128,), dtype=tf.int32, name='input_mask') # attention masks that help BERT focus on actual tokens while ignoring padding.
    input_type_ids = layers.Input(shape=(128,), dtype=tf.int32, name='input_type_ids') # segment IDs that distinguish between different sentences within a single input.

    # Correct the input format to a dictionary
    bert_layer = hub.KerasLayer("https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-256_A-4/2", trainable=True)
    bert_inputs = {
        'input_word_ids': input_word_ids,
        'input_mask': input_mask,
        'input_type_ids': input_type_ids
    }

    # bert_layer returns a dictionary of output tensors
    bert_outputs = bert_layer(bert_inputs)
    pooled_output = bert_outputs['pooled_output']  # Extract pooled_output
    output = layers.Dense(1, activation='sigmoid')(pooled_output) # sigmoid activation is a mathematical function commonly used in neural networks for binary classification problems

    model = tf.keras.Model(inputs=[input_word_ids, input_mask, input_type_ids], outputs=output)
    return model

model = create_model()

# Compile the model
optimizer = tf.keras.optimizers.Adam(learning_rate=3e-5)
model.compile(optimizer=optimizer, loss='binary_crossentropy', metrics=['accuracy'])

Define callbacks and train model

checkpoint_cb = ModelCheckpoint(
    '/content/drive/My Drive/synshade/checkpoints/model_checkpoint.h5',
    save_best_only=True,
    monitor='val_loss',
    mode='min',
    verbose=1
)

early_stopping_cb = EarlyStopping(
    monitor='val_loss',
    patience=3,
    verbose=1,
    restore_best_weights=True
)

checkpoint_path = '/content/drive/My Drive/synshade/checkpoints/model_checkpoint.h5'
if os.path.exists(checkpoint_path):
    model.load_weights(checkpoint_path)
    print("Loaded model from checkpoint.")

history = model.fit(
    train_ds,
    validation_data=val_ds,
    epochs=10,
    callbacks=[checkpoint_cb, early_stopping_cb]
)

Save final model

model.save('/content/drive/My Drive/synshade/final_model.h5')

# Evaluate the model on the test set
loss, accuracy = model.evaluate(test_ds)
print(f"Test loss: {loss}")
print(f"Test accuracy: {accuracy}")

Outcome

After the model is done with the training, a test loss/accuracy will be printed out. The model itself will be saved in Hierarchical Data Format version 5 (extension h5).

Before deploying your model against real-wordl data, I highly recommend running a confusion matrix to evaluate its performance, as this will help you understand how well the model is performing by comparing the predicted labels with the actual labels.