Introduction
SynShade is an NLP-powered framework designed to classify malicious recruitment and advertising chatter from the dark web — specifically from Telegram channels associated with insider threat campaigns. While the current version (v1) handles binary classification (recruiting vs. advertising) thru the sigmoid function, the next iteration leverages the normalized exponential function (softmax), which introduces multi-class industry prediction, opening new dimensions for cyber threat intelligence (CTI). Outside of Telegram channels, you can also use SynShade on any valid dataset you might possess for training or analysis (X, Facebook, Discord, Reddit), assuming you took care of the scrapping/data cleaning part.
The trained model is available on Hugging Face: synshade-insider-threat-detector
Before we dig into the model training though, I think it’s important to highlight potential extensions of SynShade for CTI (Skip to “Problem: Detecting Insider Threats at Scale” if you don’t care about this).
Potential Extensions of SynShade for Cyber Threat Intelligence (CTI)
The underlying architecture of SynShade—based on bert-base-uncased and trained on domain-specific Telegram data—offers a flexible foundation for several advanced CTI use cases beyond insider recruitment detection. Here are some predicted applications:
1. Threat Actor Intent Classification
Extend the binary classifier into a multi-class system to identify the intent behind messages. Potential classes include recruiting, scamming, attack planning, tool sharing, and data selling. This enables better prioritization based on threat severity and actor intent.
2. Jargon and Codeword Decoding
Fine-tune the model to recognize evolving dark web slang and obfuscation techniques. This supports detection of concealed terminology used for malware, insiders, or exploits, making it valuable for real-time intelligence collection.
3. Threat Actor Profiling
Analyze linguistic patterns across messages to group users by tone, intent, or thematic focus. This helps in building psychological and operational profiles of threat actors, supporting attribution efforts.
4. Named Entity Recognition (NER)
Adapt the model to extract entities such as company names, job titles, tools, or targets from messages. This transforms unstructured chatter into structured intelligence that can feed dashboards or intelligence platforms.
5. Source Credibility Scoring
Train the model to assess the reliability of messages or sources based on content patterns, metadata, or engagement history. This helps analysts separate signal from noise and focus on high-fidelity intelligence.
6. Malware and Tool Detection
Classify messages that mention or distribute malware, exploit kits, information stealers, or C2 frameworks. This supports real-time tracking of tool distribution and early detection of new malware families.
7. Dark Web Market Intelligence
Extract information about illicit product categories, pricing, vendor behavior, and market trends. This application supports fraud intelligence, supply chain risk analysis, and market disruption efforts.
8. Attack Timeline Prediction
Classify message content into phases of an attack lifecycle: planning, execution, or post-incident. This helps in forecasting active campaigns and enabling preemptive defense measures.
9. TTP (Tactics, Techniques, Procedures) Detection
Label messages according to known techniques in frameworks such as MITRE ATT&CK. This enables automatic mapping of dark web discussions to specific adversary behaviors and campaign indicators.
Problem: Detecting Insider Threats at Scale
Dark web chatter related to insider recruitment is often buried under layers of slang, jargon, and obfuscation. Manual monitoring is labor-intensive and slow to scale. SynShade addresses this with a robust multi-task BERT architecture that automates classification while maintaining high accuracy, even in noisy, unstructured environments.
Methodology
Data Collection and Preparation
Data was collected from curated Telegram channels focused on insider recruiting or advertising. Each entry in the dataset includes:
- Message (
Msg) - Action Label (Recruiting or Advertising)
- Industry Label (Technology, Retail, Finance, Telecom — planned for v2)
Model Architecture
- Base Model:
bert-base-uncasedfrom Hugging Face Transformers - Tokenization: 128-token maximum length with padding and truncation
- Multi-Task Output Heads:
- Binary classification for Action (Recruiting vs. Advertising)
- Multi-class classification for Industry (Planned for v2)
Training and Optimization
- Environment: Google Colab Free (12GB RAM, GPU NVIDIA T4, 100GB DISK)
- Framework: TensorFlow, integrated with Hugging Face
- Optimizer: AdamW (
learning_rate = 2e-5) - Batch Size: 32 (adjusted based on GPU memory)
- Epochs: 3 (Model is set to run through 10 epochs with a patience of 2, meaning if the val_loss doesn’t improve after 2 epochs, the training stops early)
- Regularization: Dropout (0.1), EarlyStopping based on validation loss
Computational Infrastructure
SynShade was trained on Google Colab Free using a T4 GPU. Resource usage (GPU memory, RAM) was actively monitored to dynamically adjust batch sizes. Model checkpoints were saved to Google Drive after each epoch to ensure reproducibility and allow recovery from session disconnects.
Problem Solved
This project enhances cyber threat intelligence capabilities by automating the classification of insider recruitment chatter from the dark web. Through BERT-based multi-task learning, SynShade supports scalable, real-time detection pipelines for cybersecurity teams. The approach bridges advanced NLP with cloud-based infrastructure to deliver a reliable and extensible monitoring tool.
Key Takeaways
-
NLP-driven detection of dark web insider chatter
Fine-tuning BERT on domain-specific Telegram messages enables precise threat classification. -
Two-stage classification pipeline
Stage 1: Binary classification (recruiting vs. advertising). Stage 2: Industry targeting prediction (coming in v2). -
Tailored for operational use
Preprocessing, expert labeling, and cloud-based training ensure SynShade is ready for real-world cybersecurity deployment.
Code: Mount, Install, and Import
- Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')
- Install required libraries
!pip install -U "tensorflow-text==2.13.*"
!pip install "tf-models-official==2.13.*"
!pip install transformers
- Import Modules
import os
import pandas as pd
from sklearn.model_selection import train_test_split
import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_text as text
from tensorflow.keras.callbacks import ModelCheckpoint, EarlyStopping
from tensorflow.keras import layers
from transformers import BertTokenizer
print(f"TensorFlow version: {tf.__version__}")
print(f"TensorFlow Hub version: {hub.__version__}")
- Load BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
- Preprocess Data
# Converts the Action column to binary format where Recruiting is 1, and other actions are 0.
def preprocess_data(df):
df['Action'] = (df['Action'] == 'Recruiting').astype(int)
return df
# Encodes the text data into token IDs, attention masks, and token type IDs using the BERT tokenizer.
# This step converts raw text into numerical format suitable for input to a BERT model.
def encode(texts, tokenizer, max_length=128):
encodings = tokenizer(
texts.tolist(),
truncation=True,
padding=True,
max_length=max_length,
return_tensors='tf'
)
return encodings['input_ids'], encodings['attention_mask'], encodings['token_type_ids']
# Converts the preprocessed DataFrame into a TensorFlow Dataset.
# This dataset includes tokenized inputs and labels, ready for model training.
def df_to_dataset(dataframe, shuffle=True, batch_size=32):
dataframe = dataframe.copy()
labels = dataframe.pop('Action')
texts = dataframe.pop('Message')
input_ids, attention_masks, token_type_ids = encode(texts, tokenizer)
ds = tf.data.Dataset.from_tensor_slices((
{
"input_word_ids": input_ids,
"input_mask": attention_masks,
"input_type_ids": token_type_ids # Changed from 'segment_ids' to 'input_type_ids'
},
labels.values
))
if shuffle:
ds = ds.shuffle(buffer_size=len(dataframe))
ds = ds.batch(batch_size)
return ds
# Load and preprocess data (adjust data path accordingly. Here, "My Drive" is your Drive main page)
data = pd.read_csv('/content/drive/My Drive/synshade/data.csv')
data = preprocess_data(data) # Convert 'Action' to binary
train_data, temp_data = train_test_split(data, test_size=0.30, random_state=42)
val_data, test_data = train_test_split(temp_data, test_size=0.50, random_state=42)
# Save datasets to drive
train_data.to_csv('/content/drive/My Drive/synshade/train_data.csv', index=False)
val_data.to_csv('/content/drive/My Drive/synshade/val_data.csv', index=False)
test_data.to_csv('/content/drive/My Drive/synshade/test_data.csv', index=False)
print(f"Training set size: {len(train_data)}")
print(f"Validation set size: {len(val_data)}")
print(f"Test set size: {len(test_data)}")
- Load Datasets and Prep Training
# Load datasets from disk
train_data = pd.read_csv('/content/drive/My Drive/synshade/train_data.csv')
val_data = pd.read_csv('/content/drive/My Drive/synshade/val_data.csv')
test_data = pd.read_csv('/content/drive/My Drive/synshade/test_data.csv')
batch_size = 32
AUTOTUNE = tf.data.AUTOTUNE
train_ds = df_to_dataset(train_data, batch_size=batch_size)
val_ds = df_to_dataset(val_data, shuffle=False, batch_size=batch_size)
test_ds = df_to_dataset(test_data, shuffle=False, batch_size=batch_size)
train_ds = train_ds.cache().prefetch(buffer_size=AUTOTUNE)
val_ds = val_ds.cache().prefetch(buffer_size=AUTOTUNE)
test_ds = test_ds.cache().prefetch(buffer_size=AUTOTUNE)
- Create and compile model
def create_model():
input_word_ids = layers.Input(shape=(128,), dtype=tf.int32, name='input_word_ids') # token IDs generated by the BERT tokenizer, representing the tokenized words in each input sentence.
input_mask = layers.Input(shape=(128,), dtype=tf.int32, name='input_mask') # attention masks that help BERT focus on actual tokens while ignoring padding.
input_type_ids = layers.Input(shape=(128,), dtype=tf.int32, name='input_type_ids') # segment IDs that distinguish between different sentences within a single input.
# Correct the input format to a dictionary
bert_layer = hub.KerasLayer("https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-256_A-4/2", trainable=True)
bert_inputs = {
'input_word_ids': input_word_ids,
'input_mask': input_mask,
'input_type_ids': input_type_ids
}
# bert_layer returns a dictionary of output tensors
bert_outputs = bert_layer(bert_inputs)
pooled_output = bert_outputs['pooled_output'] # Extract pooled_output
output = layers.Dense(1, activation='sigmoid')(pooled_output) # sigmoid activation is a mathematical function commonly used in neural networks for binary classification problems
model = tf.keras.Model(inputs=[input_word_ids, input_mask, input_type_ids], outputs=output)
return model
model = create_model()
# Compile the model
optimizer = tf.keras.optimizers.Adam(learning_rate=3e-5)
model.compile(optimizer=optimizer, loss='binary_crossentropy', metrics=['accuracy'])
- Define callbacks and train model
checkpoint_cb = ModelCheckpoint(
'/content/drive/My Drive/synshade/checkpoints/model_checkpoint.h5',
save_best_only=True,
monitor='val_loss',
mode='min',
verbose=1
)
early_stopping_cb = EarlyStopping(
monitor='val_loss',
patience=3,
verbose=1,
restore_best_weights=True
)
checkpoint_path = '/content/drive/My Drive/synshade/checkpoints/model_checkpoint.h5'
if os.path.exists(checkpoint_path):
model.load_weights(checkpoint_path)
print("Loaded model from checkpoint.")
history = model.fit(
train_ds,
validation_data=val_ds,
epochs=10,
callbacks=[checkpoint_cb, early_stopping_cb]
)
- Save final model
model.save('/content/drive/My Drive/synshade/final_model.h5')
# Evaluate the model on the test set
loss, accuracy = model.evaluate(test_ds)
print(f"Test loss: {loss}")
print(f"Test accuracy: {accuracy}")
Outcome
After the model is done with the training, a test loss/accuracy will be printed out. The model itself will be saved in Hierarchical Data Format version 5 (extension h5).
Before deploying your model against real-wordl data, I highly recommend running a confusion matrix to evaluate its performance, as this will help you understand how well the model is performing by comparing the predicted labels with the actual labels.