Date: March 25, 2025 Dataset Title: Large-scale plant transcriptome mining reveals macrocyclic diversification and improved lung cancer cell cytotoxicity of the stephanotic acid scaffold Dataset Creators: Xiaofeng Wang, Khadija Shafiq, Derrick Ousley, Desnor N. Chigumba, Dulciana Davis, Kali McDonough, Lisa S. Mydy, Jonathan Z. Sexton, Roland D. Kersten Dataset Contact: Roland D. Kersten rkersten@umich.edu Funding: R35GM146934 (NIGMS), Herman Frasch Foundation, T32GM140223 (NIGMS), F32GM146395 (NIGMS), F31GM155959 (NIGMS), Rackham Merit Fellowship Program, PhRMA Foundation Predoctoral Fellowship, Rackham Predoctoral Fellowship Key Points: - We applied scaled de novo transcriptome assembly for the discovery of stephanotic acid-type burpitides and underlying cyclases. - 7,579 plant species transcriptomes were assembled de novo with rnaSPAdes (v3.15.5) and searched for BURP domain transcripts encoding stephanotic acid core peptides (QLxxW) by tblastn (BLAST+ v2.16.0) on Sequenceserver (v3.1.0) and RepeatFinder. - Candidate stephanotic-acid burpitide cyclases from 37 species were predicted and stephanotic acid-type burpitide cyclases with new second-ring-crosslinks were identified. Research Overview: Moroidins are plant ribosomally synthesized and posttranslationally modified peptides (RiPPs) called burpitides biosynthesized from copper-dependent peptide cyclases. The bicyclic structure of moroidins contains (1) a stephanotic acid scaffold with a Leu-Cꞵ-Trp-indole-C6-crosslink and (2) a C-terminal ring formed by a Trp-indole-C2-His-imidazole-N1-crosslink. Moroidin is cytotoxic to H1437 non-small cell lung adenocarcinoma cells in vitro, underscoring the potential of stephanotic acid-type burpitides as anticancer lead structures and the importance of exploring diversification strategies to discover analogs with enhanced bioactivity. We mined the transcriptome of 7,579 plant species from 498 plant families to identify moroidin analogs with novel second ring structures and the cyclases responsible for their biosynthesis. A search of >27,000 candidate burpitide cyclases reveals two stephanotic acid-type burpitides in plants with new second ring crosslinks derived from posttranslational modification: Glechomanin from ground ivy (Glechoma hederacea) with a C-C-crosslink between a C-terminal tryptophan-indole-C6 and the β-carbon of a valine, and Mercurialin from annual mercury (Mercurialis annua) featuring a C-O-crosslink between a C-terminal tyrosine-phenol hydroxy and the β-carbon of a phenylalanine, respectively. Furthermore, our transcriptomics-guided burpitide genotyping enabled isolation of a moroidin analog from water chickweed (Stellaria aquatica), which exhibits a nine-fold higher in vitro cytotoxicity than moroidin against H1437 lung adenocarcinoma cells. We demonstrate that plant transcriptome mining can expand the medicinal chemistry toolbox for chemical and biological exploration of burpitide lead structures. Methodology: Paired-end RNA-seq data were downloaded from NCBI sequence read archive via fasterq-dump (parameter: --split-files) of the SRA Toolkit (v2.10.9) to the Great Lakes High Performance Computing (HPC) Cluster at the University of Michigan, Ann Arbor. For large-scale assembly, the datasets were trimmed (TrimGalore v0.6.7) and assembled with rnaSPAdes (v3.15.5) on the Great Lakes HPC cluster as specified below. Assemblies which failed at 48 GB memory were assembled at 180 GB memory as noted in Data S6. rnaSPAdes: #SBATCH --partition=standard #SBATCH --gres=gpu:1 #SBATCH --nodes=1 #SBATCH --ntasks=1 #SBATCH --cpus-per-task=7 #SBATCH --mem=48g module load spades/3.15.5-jhe6qq2 spades.py --rna -1 ./SRA#_1_val_1.fq -2 ./SRA#_2_val_2.fq -o SRAaccession_file Files contained here: plant_transcriptomes_1000.tar.gz – plant transcriptome assemblies 1-1000 plant_transcriptomes_2000.tar.gz – plant transcriptome assemblies 1001-2000 plant_transcriptomes_3000.tar.gz - plant transcriptome assemblies 2001-3000 plant_transcriptomes_4000.tar.gz - plant transcriptome assemblies 3001-4000 plant_transcriptomes_5000.tar.gz - plant transcriptome assemblies 4001-5000 plant_transcriptomes_6000.tar.gz - plant transcriptome assemblies 5001-6000 plant_transcriptomes_7000.tar.gz - plant transcriptome assemblies 6001-7000 plant_transcriptomes_8000.tar.gz - plant transcriptome assemblies 7001-8000 plant_transcriptomes_8000+.tar.gz - plant transcriptome assemblies 8001-8893 genome-guided-assembly-benchmarking.tar.gz - genome-guided plant transcriptome assemblies denovo-assembly-benchmarking.tar.gz - de novo plant transcriptome assemblies Data_S3_-_Benchmarking_assembly.xlsx - Information about datasets from de novo RNA-seq assembly benchmarking Data_S5_-_Benchmarking_assembly_-_de_novo_vs_genome-guided.xlsx - Information about datasets from comparison of de novo RNA-seq assembly and genome-guided RNA-seq assembly Data_S6_-_Transcriptome_table.xlsx - Information about plant transcriptomes (1-8893) For RNA-seq benchmarking, 27 RNA-seq datasets were trimmed (TrimGalore v0.6.7) and assembled de novo via SPAdes (v3.15.5), MEGAHIT (v1.2.9), and Trinity (v2.15.1) (Data S3). 16 RNA-seq datasets were assembled genome-guided via StringTie (v2.2.1) or Trinity (v2.15.1) after alignment and mapping of trimmed reads to the annotated reference genome (nucleotide fasta and gtf/gff3) with STAR (v2.7.11a) (Data S5). Files contained here: All scripts for this project are deposited on GitHub: https://github.com/UM-KerstenLab4009/Moroidin-Transcriptomics Related publication(s): Wang, X., Shafiq, K., Ousley, D. et al. (2025) Large-scale plant transcriptome mining reveals macrocyclic diversification and improved lung cancer cell cytotoxicity of the stephanotic acid scaffold. Forthcoming. Use and Access: This data set is made available under a Creative Commons Public Domain license (CC0 1.0). To Cite Data: Wang, X., Shafiq, K., Ousley, D., Chigumba, D.N., Davis, D., McDonough, K., Mydy, L.S., Sexton, J.Z. & Kersten, R.D. (2024) Large-scale plant transcriptome mining reveals macrocyclic diversification and improved lung cancer cell cytotoxicity of the stephanotic acid scaffold [Data set]. University of Michigan - Deep Blue.