Ensemble Learning of Text Detection for Large Language Models
Based on Integrated Language Features

Luyan Zhang
XTech AI Lab
Large Language Models Text Detection Ensemble Learning BERT Language Features

1 Abstract

The rapid development of large language models has greatly facilitated daily life and work, but it has also posed challenges for individuals and society. Therefore, there is an urgent need for tools capable of detecting text generated by large language models. To balance detection performance and generalization ability, this paper proposes a text detection method based on ensemble learning of linguistic features, called EBF Detection. This method integrates fine-tuned pretrained language models with advanced natural language statistical features and employs a decision mechanism to detect text generated by large language models. Experimental results demonstrate that EBF Detection achieved an average detection accuracy of 98.72% on in-domain data and 96.79% on out-of-domain data.

2 Methodology

System Architecture

Input Text

Text to be classified

Feature Extraction

BERT + Statistical Features

Weak Detectors

6 Specialized Models

Ensemble Voting

Majority Decision

Final Classification

Human vs AI-generated

BBERT-based Detectors

Leverages pre-trained models' semantic understanding capabilities:

  • BERT Detector: Deep semantic feature extraction
  • BERTTextCNN: Combines BERT with CNN for pattern recognition

MMulti-Feature Detection

Integrates multiple linguistic features:

  • Token Probability: GPT-2 log-likelihood analysis
  • Syntactic Structure: Dependency parsing and tree analysis
  • Lexical Diversity: Type-token ratio and semantic richness

SStatistical Detectors

Probabilistic analysis methods:

  • Log-Probability: Token likelihood patterns
  • Log-Rank: Probability rank distributions

3 Experimental Results

Performance Metrics

98.55%
In-Domain
Accuracy
96.79%
Out-of-Domain
Accuracy
98.7%
Parameter
Reduction
6
Ensemble
Detectors
Detection Accuracy Comparison
Detection Accuracy Comparison

Figure 1: Detection accuracy comparison showing EBF Detection outperforming other methods with 98.55% accuracy

Multi-Feature Detection Components
Multi-Feature Detection Components

Figure 2: Multi-Feature Detection Components showing the integration of token probability, syntactic structure, and lexical diversity features

In-Domain vs Out-of-Domain Performance
In-Domain vs Out-of-Domain Performance

Figure 3: Radar chart comparing in-domain and out-of-domain performance across different detection methods

EBF Detection Architecture
EBF Detection Architecture

Figure 4: Overall architecture of EBF Detection showing the voting mechanism combining six detectors

Method In-Domain Accuracy (%) Out-of-Domain Accuracy (%) F1 Score
Log-Probability 90.19 89.77 90.42
Log-Rank 89.61 90.40 90.32
GLTR 90.17 88.10 90.14
DetectGPT 79.26 87.52 79.69
BERT 90.58 85.66 90.60
EBF Detection 98.55 96.06 98.47

4 Key Components

BERT-based Detectors

The BERT Detector and BERTTextCNN leverage pre-trained models' semantic understanding capabilities, which are effective in capturing deep contextual differences between human and AI-generated text. These models excel at understanding semantic coherence and contextual relationships.

Statistical Feature Detectors

Multi-Feature Detection focuses on shallow linguistic statistics including:

Ensemble Integration

The integration process combines outputs using majority voting, leveraging the complementary strengths of deep semantic features (from BERT-based models) and shallow statistical features (from multi-feature and log-based detectors). This ensures comprehensive coverage of both semantic and stylistic differences.

5 Conclusion

This paper introduces EBF Detection, a robust method for detecting text generated by large language models. By integrating multiple weak detectors through ensemble learning, the approach achieves:

Superior Performance

Industry-leading accuracy of 98.72% on in-domain data, significantly outperforming existing methods

Excellent Generalization

Maintains 96.79% accuracy on out-of-domain data, demonstrating robustness to domain shifts

Practical Implementation

Does not require watermarking or modification of generated text, preserving naturalness

Future Work

Future research will focus on: