PDB 3Di Chains Dataset
This repository contains chain-level sequences and 3Di tokens derived from RCSB PDB structures, with per-chain polymer class labels (prot, DNA, RNA, other). Files are chunked/merged from 1k-folders and cleaned for consistent CSV schema.
Current data based on only 120K proteins.
Files
3di_chains_chaintag_filtered_prot_only.csv <- Use this one, the main file
3di_chains_chaintag_all.csv
474 MB
3di_chains_chaintag_prot_only.csv
464 MB
3di_chains_chaintag_sample_1000.csv
3di_chains_chaintag_filtered_prot_only.csvfiltered out repeated protein and DNA RNA chain. Main file used for SFT_allcontains all chains (including D-amino, RNA, DNA, and any others)._prot_onlycontains protein chains only (L- and D-amino acids treated as protein)._sample_1000is a 1,000-row random sample for quick inspection.
All raw PDB files are in https://drive.google.com/drive/folders/1jdz5c_EoNCpqXXmDdklr1tlZtzN8b4jY?usp=sharing
Schema
All CSVs use the same columns:
indexβ global row index from the source CSV (0-based)pdb_idβ 4-character PDB code (e.g.,9B4J)chain_idβ chain identifier (alphanumeric, may include digits/letters)aa_seqβ amino-acid sequence (when available)threeDi_seqβ Foldseek 3Di token sequence (Used for sft)combined_seqβ helper concatenation (3Di/AA) used upstreamseq_lenβ chain sequence length (prefer AA length; else derived)chunkβ source folder name (e.g.,1000_1999)pathβ absolute path to the structure file usedpolymer_classβ one ofprot,DNA,RNA,other
(D-amino peptides are classified asprot)
Basic Stats (from _all)
=== BASIC COUNTS === Proteins (unique pdb_id) : 111629 Chains (rows) : 418335 Avg chains per protein : 3.75
=== CHAIN-LEVEL COMPOSITION === prot: 391317 (93.54%) DNA: 18575 (4.44%) RNA: 8375 (2.00%) other: 68 (0.02%)
=== PROTEIN-LEVEL PRESENCE (non-exclusive) === Proteins with any prot: 109267 (97.88%) Proteins with any DNA : 7334 (6.57%) Proteins with any RNA : 4622 (4.14%)
=== CHAIN LENGTH (residues) === mean: 254.88 | Q25: 103 | Q50: 208 | Q75: 335
=== PROTEIN TOTAL LENGTH (sum of chains) === mean: 955.19 | Q25: 270 | Q50: 516 | Q75: 1089