Unlocking Blockchain Identity: How to Predict Account Roles Through Transaction Analysis

Blockchain technology has revolutionized industries by offering a secure, decentralized, and transparent framework for recording transactions. Within blockchain networks, countless accounts interact daily—sending, receiving, and validating transactions. While many of these interactions represent legitimate services or user activities, others may indicate malicious behavior such as fraud or phishing. Accurately identifying the roles behind blockchain addresses is essential for enhancing security, improving analytics, and mitigating risks in decentralized ecosystems.

This article explores the challenge of predicting blockchain account identities using transaction data—specifically focusing on ETH and ERC20 transactions—and outlines strategies for building effective machine learning models to classify account types such as DEX, CEX, lending, phishing, and more.

Understanding the Challenge: Predicting Blockchain Account Roles

The core task involves analyzing raw transaction data to determine the functional role of unknown blockchain addresses. Given labeled training data, participants must develop models that can generalize patterns and accurately predict labels for unseen accounts in a test dataset.

This problem directly contributes to blockchain security, fraud detection, and on-chain analytics, making it highly relevant for developers, researchers, and organizations operating in Web3.

👉 Discover how real-time blockchain analysis powers next-gen security tools.

Data Overview: ETH and ERC20 Transaction Records

The dataset is structured under competition_dataset/training and competition_dataset/testing, with each account represented by two CSV files:

ETH_transaction_lst/<address>.csv: Contains native Ethereum transactions.
ERC20_transaction_lst/<address>.csv: Includes token transfers (e.g., USDT, UNI).

Each file may be empty—indicating no activity—but every test address has at least one non-empty transaction file.

Key Features in ETH Transactions

from/to: Sender and receiver addresses.
value: Transaction amount in wei (1 ETH = 10¹⁸ wei).
timeStamp: Unix timestamp indicating when the transaction occurred.
isError / txreceipt_status: Flags for failed transactions.
gasUsed / gasPrice: Indicators of transaction cost and network demand.
functionName / methodId: Reveals smart contract interactions (e.g., swaps, deposits).

Key Features in ERC20 Transactions

contractAddress: Identifies the token involved (e.g., USDC contract).
tokenSymbol / tokenName: Human-readable token identifier (e.g., DAI).
value / tokenDecimal: Raw value and decimal scaling factor (e.g., divide by 10⁶ for USDT).
from/to: Same as ETH but specific to token movement.

These fields form the foundation for feature engineering, enabling insights into behavioral patterns across different account types.

Label Categories: What Are We Classifying?

The training set includes labeled accounts under seven primary categories:

dex – Decentralized exchanges (e.g., Uniswap), characterized by frequent swaps and liquidity provision.
cex – Centralized exchanges (e.g., Binance), often showing high-volume inflows/outflows and centralized deposit addresses.
lending – Protocols like Aave or Compound; marked by collateral deposits, borrow events, and interest accruals.
gambling – On-chain betting platforms with repetitive small-value transactions and random payout patterns.
phishing – Malicious accounts tricking users into sending funds; often show sudden spikes in incoming transfers from diverse sources.
wallet – Personal wallets used for storage and occasional transfers; typically low-frequency activity.
payments – Merchant or service accounts receiving regular micropayments or subscription fees.

Understanding these behaviors allows us to extract meaningful features that distinguish one class from another.

Feature Engineering: From Raw Data to Predictive Signals

Effective classification starts with intelligent feature extraction. Here are some high-impact features to consider:

Behavioral Metrics

Total number of incoming/outgoing transactions
Average, median, and variance of transaction values
Frequency of transactions per hour/day
Ratio of successful vs failed transactions
Number of unique counterparties interacted with

Temporal Patterns

Time between first and last transaction (account lifespan)
Peak activity hours (e.g., 9–11 AM UTC might suggest human-driven use)
Burstiness index: measures irregularity in transaction timing

Economic Indicators

Total ETH sent/received (converted from wei)
Total gas consumed across all transactions
Proportion of interactions with known contract addresses

Network-Based Features

In-degree and out-degree (number of senders/receivers)
Clustering coefficient if constructing a subgraph
Presence of loops (sending back to self or circular transfers)

👉 Learn how advanced behavioral modeling improves blockchain threat intelligence.

Modeling Strategies: Machine Learning Approaches

With engineered features in place, several modeling paths can be explored:

Supervised Learning

Use labeled training data to train classifiers such as:

Random Forest
XGBoost or LightGBM
Logistic Regression (baseline)

These models handle tabular data well and offer interpretability.

Ensemble Methods

Combine multiple models using:

Stacking: meta-model learns from base predictions
Boosting: sequentially correct errors (e.g., AdaBoost)
Bagging: reduce variance via bootstrap aggregation

Ensembles often outperform single models in accuracy and robustness.

Dimensionality Reduction

Apply techniques like:

PCA (Principal Component Analysis)
t-SNE or UMAP for visualization
Feature selection via mutual information or SHAP values

This helps eliminate noise and speed up training.

Graph Neural Networks (GNNs)

Since blockchain data is inherently relational, GNNs can capture structural patterns:

Treat accounts as nodes, transactions as edges
Use node features (transaction stats) + graph structure
Models like GCN, GAT, or GraphSAGE can detect complex dependencies

While powerful, GNNs require careful graph construction and higher computational resources.

Model Evaluation & Submission Guidelines

The final score is based on prediction accuracy relative to a reference solution, scaled to a maximum of 500 points:

Score = min(500 × (your accuracy / reference accuracy), 500)

For example:

Your accuracy: 75%
Reference accuracy: 85%
Final score: 500 × 0.75 / 0.85 ≈ 441 points

Required Submission Files:

prediction.csv: Must follow the format of demonstrated_answer_format.csv, listing predicted labels for each test address.
model_checkpoint.pth (or equivalent): Full model state dict or saved model file, ensuring reproducibility and inference capability.

Ensure your model can be loaded without external dependencies.

Frequently Asked Questions (FAQ)

Q: Can I use both ETH and ERC20 data together?
A: Yes. Combining both datasets often leads to richer feature sets. For example, DEX users frequently engage in ERC20 token swaps.

Q: How should I handle empty transaction files?
A: Empty files mean no historical activity. You can still extract zero-based features (e.g., “0 outgoing transactions”) or infer from metadata.

Q: Are there imbalances in label distribution?
A: Likely yes. Some classes (like phishing) may be underrepresented. Consider using class weighting or oversampling techniques like SMOTE.

Q: Should I normalize numerical features?
A: Absolutely. Features like value or gasUsed span many orders of magnitude. Use log-scaling or standardization before modeling.

Q: Can I include external data?
A: Unless specified otherwise, stick to the provided dataset to ensure fairness. Using public label databases may violate competition rules.

Q: Is timing important in predictions?
A: Yes. Transaction timestamps allow you to build time-aware features such as session bursts or dormancy periods—critical for distinguishing bots from humans.

Final Thoughts: Advancing Blockchain Intelligence

Predicting blockchain account roles is more than an academic exercise—it’s a critical step toward securing decentralized systems. As DeFi, NFTs, and Web3 applications grow, so does the need for intelligent tools that can detect fraud, identify services, and map ecosystem dynamics.

By leveraging transaction metadata, behavioral analytics, and modern machine learning techniques, we can move closer to real-time identity inference on public ledgers—without compromising privacy or decentralization.

Whether you're building fraud detection engines or enhancing wallet security, this type of analysis forms the backbone of next-generation blockchain intelligence.

👉 Explore how cutting-edge platforms leverage blockchain forensics for real-world impact.