Scientists find 1,700 'dark proteins' in genome

- Researchers in the TransCODE consortium reported on May 6 that human cells make 1,785 previously unannotated microproteins from overlooked genomic regions. - The team analyzed 95,520 proteomics experiments and 3.7 billion data points, finding about 25% of 7,264 non-canonical open reading frames produced detectable molecules. - The Nature paper and linked public datasets give other scientists a roadmap for follow-up functional and disease studies.

An international research consortium has reported evidence that human cells produce 1,785 previously unannotated microproteins from parts of the genome long treated as noncoding or too marginal to matter. The study, published in *Nature* on May 6, comes from the TransCODE consortium, which analyzed 7,264 non-canonical open reading frames, or ncORFs, and found roughly a quarter generated detectable protein-like molecules. The work adds a new layer to the long-running effort to define what the human genome actually makes. It also gives those molecules a new label — “peptideins” — for cases where the products look protein-like but their biological function is still unclear. ### Where did these “dark proteins” come from? The newly reported molecules come from ncORFs, short stretches of DNA that sit outside the standard gene catalog or within regions not usually annotated as protein-coding. Researchers have been finding signs for years that some of these regions are translated, but the new paper tries to measure that systematically at protein level rather than treating them as isolated curiosities. (embl.org) The consortium said the “dark proteome” refers to gene products from overlooked sections of DNA. In this study, the team found that many of those products were unusually small: 65% were fewer than 50 amino acids long, compared with less than 1% of the roughly 19,500 proteins in standard curated databases. ### Why are scientists calling some of them “peptideins”? The term “peptidein” is a classification tool, not a claim that every newly detected molecule is a conventional protein with a known job. (embl.org) The researchers introduced it because many of the molecules are too small, too evolutionarily recent, or too poorly characterized to fit cleanly into older protein definitions used by reference databases. (e3.eurekalert.org) Jonathan Mudge of EMBL-EBI, a co-first author, said the category was meant to bring these molecules “out of the shadows and into reference annotation,” according to EMBL. Hospital del Mar Research Institute, one of the participating centers, said the naming is intended to make it easier to include them in databases and study their functions. ### How did the team find them? (embl.org) The researchers pulled together a large proteogenomics workflow rather than relying on one experiment. The study drew on 95,520 mass spectrometry-based proteomics experiments and 3.7 billion raw data points, a computational effort the consortium said took about 20,000 hours of nonstop processing. GenomeWeb reported that the team first applied strict Human Proteome Project-style evidence rules and got relatively few hits. (embl.org) The count increased when researchers broadened the search and, especially, when they added immunopeptidomic datasets — peptides presented by human leukocyte antigen molecules — which helped bring the total to 3,116 peptides encoded by 1,785 ncORFs. (e3.eurekalert.org) ### Does this mean scientists found 1,700 brand-new genes? The paper does not say 1,785 classic genes were added to the textbook list. It says researchers found protein-level evidence for translation from 1,785 ncORFs, many of them outside standard annotations, and proposed a framework for how to classify and catalog those products. Nature said the work addresses a gap between evidence that ncORFs are translated and the lack of consensus on which of those products belong in the human proteome. (genomeweb.com) That makes this as much an annotation and evidence problem as a counting exercise. ### Why are disease researchers paying attention? The consortium said some of the newly detected molecules appear on immune cell surfaces, making them possible candidates for cancer immunotherapy targets. (nature.com) EMBL said cancer cells express high levels of some of these molecules, and Hospital del Mar said some have already been linked to childhood cancers and basic cellular functions. At the same time, the researchers were explicit that function remains unresolved for many of them. Robert Moritz of the Institute for Systems Biology told GenomeWeb that the existence of these molecules is now clearer than their biological role, which remains the next major question. ### What happens next? The TransCODE consortium said it plans to add peptideins to reference resources including GENCODE, UniProt and PeptideAtlas, and to keep releasing data in open formats for other groups to test. (embl.org) Hospital del Mar said one next line of work will compare dark proteomes across closely related species to look for evolutionary conservation, while the consortium also used CRISPR screens to begin testing whether some peptideins are essential for cell survival. (genomeweb.com)

Scientists find 1,700 'dark proteins' in genome

Get your own daily briefing