By: on march 20, 2025

Publication Classifier

Classifies research papers and predicts publishability using SciBERT, Sentence-BERT, and self-training pipelines

Go to project
Screenshot of publication classifier with output predictions and metrics
~2 MIN

About Publication Classifier

Publication Classifier is a hybrid NLP system that analyzes research papers to determine their publishability and recommend the most suitable conference. It combines modern transformer models like SciBERT and Sentence-BERT with classic machine learning and self-training strategies to generate interpretable, high-quality predictions.

The classifier uses text embeddings from SciBERT to evaluate whether a paper should be published. If deemed publishable, a semantic similarity approach using Sentence-BERT matches the paper with top-tier conferences such as CVPR, NeurIPS, EMNLP, KDD, and TMLR.

  • SciBERT for Content Understanding
    Generates deep contextual embeddings from full paper content for classification tasks.

  • Self-Training Publishability Classifier
    Uses a small set of labeled papers to iteratively train on larger unlabeled datasets with pseudo-labels.

  • Sentence-BERT for Conference Matching
    Identifies the most semantically relevant conference based on similarity with topic prototypes.

  • Plug-and-Play with CSV Input
    Accepts labeled and unlabeled CSVs, and outputs results to results/output.csv—ready for evaluation or submission.

  • Expandable & Modular Codebase
    Easy to fine-tune, extend with new conferences, or upgrade with better models like Longformer or GPT.


Tech Stack

  • Embeddings & Transformers: SciBERT, Sentence-BERT, Hugging Face Transformers
  • Modeling: scikit-learn, self-training classifier
  • Data Handling: pandas, numpy
  • Evaluation: precision, recall, F1, confusion matrix
  • Execution: Python 3.8+, CLI-compatible scripts

Credits

  • SciBERT: For domain-specific contextual embeddings of scientific content.
  • Sentence-BERT: For high-quality semantic similarity computation.
  • scikit-learn: For baseline classification pipelines and metrics.
  • pandas/numpy: For efficient CSV handling and data preprocessing.

Author

Developed by Ayush Sharma & Rishabh Kothari.
Check out the full project on Here.
Check out more projects on GitHub or reach out via LinkedIn.