Plant natural products (PNPs) have been an important source in human nutrition, industrial raw materials, medicinal ingredients and half of anticancer drugs are derived from PNPs such as paclitaxel, vinblastine, and ginsenoside (Caputi et al., ; Luo et al., 2019; Yang et al., 2020). Biosynthesis is one of the key ways to produce PNPs, and the increasing development of medicinal Phyto-omics data helps to decode the PNPs biosynthetic pathway (Liu et al., 2017). Genetic resources also provide the basis for medicinal plants (MPs) molecular breeding.
To integrate the genome and transcriptome data of MPs, we completed the first omics database for herbal medicine (HMOD) in December 2017 (Wang et al., 2018). The less genomic data and the simple metabolites information from the website, as the data increases, makes it necessary to comprehensively optimize and upgrade the database from the data, interface, tool, and management. Thus, we constructed an integrated multi-omics database for MPs (MPOD; http://medicinalplants.ynau.edu.cn/).
MPOD collects genomes and transcriptomes of MPs published since January 2018. In addition, we sequenced six genomes, 28 transcriptomes, and five metabolomes in this study. All genomic and transcriptomic sequences in the MPOD are available for query of orthologous gene candidates, and homology comparison between gene families from different species by blast. More importantly, correlation analyses between metabolite distribution and gene expression including metabolite content in different tissues, Pearson correlation analyses of genes involved metabolic pathways and expression profile were performed. Compared with HMOD, MPOD details metabolic pathways of flavonoids, alkaloids and terpenoids, respectively. To facilitate synthetic biology, ‘the biosynthetic tools’ module is added in MPOD with some popular bioinformatics tools including SynVisio, heatmap, and enrichment.
The framework of MPOD is constructed using MySQL, ThinkPHP, and FastAdmin, with four main modules, including genomics, transcriptomics, pathways, and biosynthetic tools (Figure 1a, b). In brief, the genomics module consists of genomes, genome size, re-sequencing, and gene (Figure 1c). This module contains 154 published genomes and 6 unpublished genome-assemblies (Synsepalum dulcificum, Antirrhinum majus, Platycodon grandiflorus, Codonopsis pilosula, Panax vietnamensis, Gynostemma pentaphyllum) from this project. The web interface of species constitutes species introduction, sequencing data, assembly results, the data source links, and reference. For the published genomic data, the GCA data uploaded on NCBI has been linked to MPOD, and for unpublished data, FASTA formatted files for assembly, CDS, and protein sequences can be downloaded from this database. Genome size provides 50 plant genome size results, predicted by flow cytometry. Re-sequencing contains single nucleotide polymorphism (SNP) information of Erigeron breviscapus, P. notoginseng (He et al., 2021) from our team, and published re-sequencing data for 19 other plants. Gene section provides gene assembly, annotation, and expression profiles from E. breviscapus and Acanthopanax senticosus.
The transcriptomics module contains transcriptomes, expression, and Pearson. The transcriptomes collect 200 published and 28 de novo sequenced data in this project (Figure 1d). It consists of species introduction, sample information, sequencing data, assembly results, annotation methods, the data source links, and reference. The transcriptome data is uploaded and linked like genomes. More importantly, for 28 unpublished transcriptomes, we provide gene expression profiles from different experimental conditions or tissues in a heatmap for easy visualization. We also perform Pearson correlation analyses of genes involved in metabolic pathways using some of our transcript expression data.
The pathways module collects 85 typical compounds whose biosynthetic pathway has been deciphered, including 28 flavonoids, 28 terpenoids, 20 alkaloids, and 9 other compounds. This module lists the compound name, molecular formula, molecular weight, function, basic organisms, precursor, host, synthesis type, downstream gene, pathway, and reference (Figure 1e). Furthermore, this module also collects 7 important compounds, but their biosynthetic pathways are not completely deciphered. Similarly, it includes type of compounds, distribution, proposed pathway, and provides the sequences and expression profiles of candidate genes potentially involved in biosynthesis. It also provides five metabolomes showing that metabolite content from different tissues using heatmap.
The biosynthetic tools module lists chassis cells, catalytic components, and regulatory elements (Figure 1f). Chassis cells present 46 strains of Escherichia coli and Saccharomyces cerevisiae commonly used in biosynthesis, and Nicotiana benthamiana and Solanum lycopersicum as a heterologous expression platform for reconstituting PNPs pathways. In the section of catalytic components, 629 enzymes from 8 major gene families that play key roles in the biosynthesis of natural products were summarized, including 21 acyltransferase (ACT), 7 C-glycosyltransferase (CGT), 159 cytochrome P450 (CYP), 75 O-methyltransferase (OMT), 163 oxidosqualene cyclase (OSC), 25 squalene epoxidase (SE), 65 terpene synthases (TPS), and 114 UDP-glycosyltransferases (UGT). The accession number, gene length, sequence, reaction equation, and references are listed. The regulatory elements section presents 196 microbial promoter and terminator sequences commonly used in biosynthesis.
In addition to the main modules, MPOD provides some popular bioinformatics tools including ‘BLAST’, ‘Search’, ‘Heatmap’, and ‘JBrowse’ (Dong et al., 2020). All available MPOD genomes and gene models are incorporated into JBrowse. ‘SynVisio’ shows gene synteny relationships of chromosome-level reference genomes. ‘Co-expression analysis’ creates networks comprising sets of genes whose expressions are highly correlated.
A typical case of a user using our web is shown in Figure 1g. Gypenoside A is the main active component of G. pentaphyllum, and its content is the highest in leaves from metabolome. The biosynthesis of gypenoside A begins with 2,3-oxidosqualene, but the key downstream enzymes OSC, CYP, and UGT have not been identified. A total of 235 CYPs from G. pentaphyllum (GpCYPs) were found by Blast. The phylogenetic tree was constructed based on the deduced amino acid sequences for the GpCYPs and other plant CYPs, and were distributed in eight subfamilies, namely 144 CYP71, 34 CYP85, 28 CYP72, 20 CYP86, and 4 CYP74. We also explored the expressions of GpCYPs from different tissues and presented as a heatmap. Furthermore, we performed Pearson correlation analyses of our transcript expression data among GpOSCs, GpCYPs, and GpUGTs using GpOSCs as the query gene (Figure 1g). These results facilitate the discovery of unknown genes involved in gypenoside A biosynthesis.
In summary, from genes to metabolite levels, MPOD integrates the genomics, transcriptomics, and metabolomics data of MPs published in almost recent years and sequenced in this study. These datasets provide a rich genetic resource for mining functional genes, screening molecular markers, and developing biological elements. Further combination of pathways and catalytic components greatly facilitate to decode the biosynthetic pathways of medicinal ingredients. MPOD will be continuously updated as multi-omics data increases and new bioinformatics tools emerge, so that MPOD provides long-term support to the research of MPs molecular-assisted breeding and synthetic biology.