YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

PDB 3Di Chains Dataset

This repository contains chain-level sequences and 3Di tokens derived from RCSB PDB structures, with per-chain polymer class labels (prot, DNA, RNA, other). Files are chunked/merged from 1k-folders and cleaned for consistent CSV schema. Current data based on only 120K proteins.

Files

3di_chains_chaintag_filtered_prot_only.csv <- Use this one, the main file

3di_chains_chaintag_all.csv
474 MB

3di_chains_chaintag_prot_only.csv
464 MB

3di_chains_chaintag_sample_1000.csv

  • 3di_chains_chaintag_filtered_prot_only.csv filtered out repeated protein and DNA RNA chain. Main file used for SFT
  • _all contains all chains (including D-amino, RNA, DNA, and any others).
  • _prot_only contains protein chains only (L- and D-amino acids treated as protein).
  • _sample_1000 is a 1,000-row random sample for quick inspection.

All raw PDB files are in https://drive.google.com/drive/folders/1jdz5c_EoNCpqXXmDdklr1tlZtzN8b4jY?usp=sharing

Schema

All CSVs use the same columns:

  • index β€” global row index from the source CSV (0-based)
  • pdb_id β€” 4-character PDB code (e.g., 9B4J)
  • chain_id β€” chain identifier (alphanumeric, may include digits/letters)
  • aa_seq β€” amino-acid sequence (when available)
  • threeDi_seq β€” Foldseek 3Di token sequence (Used for sft)
  • combined_seq β€” helper concatenation (3Di/AA) used upstream
  • seq_len β€” chain sequence length (prefer AA length; else derived)
  • chunk β€” source folder name (e.g., 1000_1999)
  • path β€” absolute path to the structure file used
  • polymer_class β€” one of prot, DNA, RNA, other
    (D-amino peptides are classified as prot)

Basic Stats (from _all)

=== BASIC COUNTS === Proteins (unique pdb_id) : 111629 Chains (rows) : 418335 Avg chains per protein : 3.75

=== CHAIN-LEVEL COMPOSITION === prot: 391317 (93.54%) DNA: 18575 (4.44%) RNA: 8375 (2.00%) other: 68 (0.02%)

=== PROTEIN-LEVEL PRESENCE (non-exclusive) === Proteins with any prot: 109267 (97.88%) Proteins with any DNA : 7334 (6.57%) Proteins with any RNA : 4622 (4.14%)

=== CHAIN LENGTH (residues) === mean: 254.88 | Q25: 103 | Q50: 208 | Q75: 335

=== PROTEIN TOTAL LENGTH (sum of chains) === mean: 955.19 | Q25: 270 | Q50: 516 | Q75: 1089

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support