Unlocking Blockchain Identity: How to Predict Account Roles Through Transaction Analysis

·

Blockchain technology has revolutionized industries by offering a secure, decentralized, and transparent framework for recording transactions. Within blockchain networks, countless accounts interact daily—sending, receiving, and validating transactions. While many of these interactions represent legitimate services or user activities, others may indicate malicious behavior such as fraud or phishing. Accurately identifying the roles behind blockchain addresses is essential for enhancing security, improving analytics, and mitigating risks in decentralized ecosystems.

This article explores the challenge of predicting blockchain account identities using transaction data—specifically focusing on ETH and ERC20 transactions—and outlines strategies for building effective machine learning models to classify account types such as DEX, CEX, lending, phishing, and more.


Understanding the Challenge: Predicting Blockchain Account Roles

The core task involves analyzing raw transaction data to determine the functional role of unknown blockchain addresses. Given labeled training data, participants must develop models that can generalize patterns and accurately predict labels for unseen accounts in a test dataset.

This problem directly contributes to blockchain security, fraud detection, and on-chain analytics, making it highly relevant for developers, researchers, and organizations operating in Web3.

👉 Discover how real-time blockchain analysis powers next-gen security tools.


Data Overview: ETH and ERC20 Transaction Records

The dataset is structured under competition_dataset/training and competition_dataset/testing, with each account represented by two CSV files:

Each file may be empty—indicating no activity—but every test address has at least one non-empty transaction file.

Key Features in ETH Transactions

Key Features in ERC20 Transactions

These fields form the foundation for feature engineering, enabling insights into behavioral patterns across different account types.


Label Categories: What Are We Classifying?

The training set includes labeled accounts under seven primary categories:

  1. dex – Decentralized exchanges (e.g., Uniswap), characterized by frequent swaps and liquidity provision.
  2. cex – Centralized exchanges (e.g., Binance), often showing high-volume inflows/outflows and centralized deposit addresses.
  3. lending – Protocols like Aave or Compound; marked by collateral deposits, borrow events, and interest accruals.
  4. gambling – On-chain betting platforms with repetitive small-value transactions and random payout patterns.
  5. phishing – Malicious accounts tricking users into sending funds; often show sudden spikes in incoming transfers from diverse sources.
  6. wallet – Personal wallets used for storage and occasional transfers; typically low-frequency activity.
  7. payments – Merchant or service accounts receiving regular micropayments or subscription fees.

Understanding these behaviors allows us to extract meaningful features that distinguish one class from another.


Feature Engineering: From Raw Data to Predictive Signals

Effective classification starts with intelligent feature extraction. Here are some high-impact features to consider:

Behavioral Metrics

Temporal Patterns

Economic Indicators

Network-Based Features

👉 Learn how advanced behavioral modeling improves blockchain threat intelligence.


Modeling Strategies: Machine Learning Approaches

With engineered features in place, several modeling paths can be explored:

Supervised Learning

Use labeled training data to train classifiers such as:

These models handle tabular data well and offer interpretability.

Ensemble Methods

Combine multiple models using:

Ensembles often outperform single models in accuracy and robustness.

Dimensionality Reduction

Apply techniques like:

This helps eliminate noise and speed up training.

Graph Neural Networks (GNNs)

Since blockchain data is inherently relational, GNNs can capture structural patterns:

While powerful, GNNs require careful graph construction and higher computational resources.


Model Evaluation & Submission Guidelines

The final score is based on prediction accuracy relative to a reference solution, scaled to a maximum of 500 points:

Score = min(500 × (your accuracy / reference accuracy), 500)

For example:

Required Submission Files:

  1. prediction.csv: Must follow the format of demonstrated_answer_format.csv, listing predicted labels for each test address.
  2. model_checkpoint.pth (or equivalent): Full model state dict or saved model file, ensuring reproducibility and inference capability.

Ensure your model can be loaded without external dependencies.


Frequently Asked Questions (FAQ)

Q: Can I use both ETH and ERC20 data together?
A: Yes. Combining both datasets often leads to richer feature sets. For example, DEX users frequently engage in ERC20 token swaps.

Q: How should I handle empty transaction files?
A: Empty files mean no historical activity. You can still extract zero-based features (e.g., “0 outgoing transactions”) or infer from metadata.

Q: Are there imbalances in label distribution?
A: Likely yes. Some classes (like phishing) may be underrepresented. Consider using class weighting or oversampling techniques like SMOTE.

Q: Should I normalize numerical features?
A: Absolutely. Features like value or gasUsed span many orders of magnitude. Use log-scaling or standardization before modeling.

Q: Can I include external data?
A: Unless specified otherwise, stick to the provided dataset to ensure fairness. Using public label databases may violate competition rules.

Q: Is timing important in predictions?
A: Yes. Transaction timestamps allow you to build time-aware features such as session bursts or dormancy periods—critical for distinguishing bots from humans.


Final Thoughts: Advancing Blockchain Intelligence

Predicting blockchain account roles is more than an academic exercise—it’s a critical step toward securing decentralized systems. As DeFi, NFTs, and Web3 applications grow, so does the need for intelligent tools that can detect fraud, identify services, and map ecosystem dynamics.

By leveraging transaction metadata, behavioral analytics, and modern machine learning techniques, we can move closer to real-time identity inference on public ledgers—without compromising privacy or decentralization.

Whether you're building fraud detection engines or enhancing wallet security, this type of analysis forms the backbone of next-generation blockchain intelligence.

👉 Explore how cutting-edge platforms leverage blockchain forensics for real-world impact.