Date: August 17, 2022 Dataset Title:"Bioinformatic annotation of bacterial genes encoding SMR proteins" Datset creators: Randy B. Stockbridge and Christian B. Macdonald Dataset contact: Randy B. Stockbridge, stockbr@umich.edu Funding: National Science Foundation (grant number 1845012) to R.B.S Dataset description: This dataset includes text files (.csv files) for the bioinformatic annotation of SMR genes found in a dataset of phylogenetically diverse bacterial genomes. Bioinformatic analysis includes genome mining to identify SMR genes, prediction of the functional transporter subtype, and prediction of the direction of insertion in the bacterial membrane. Research overview: This bioinformatic dataset was prepared for a review on the structures, functions, and occurrence of Small Multidrug Resistance (SMR) Transporters. This dataset includes bioinformatic annotation of SMR genes identified in bacterial genomes from the Joint Genome Institute's curated set of ~1000 Genomic Encyclopedia of Bacteria and Archaea (GEBA) genomes. The file GEBA_SMR_annotation.csv provides NCBI identification information (genome, species and chromosome information, locus tag, translation) and bioinformatic predictions of the SMR subtype and membrane insertion direction for each gene identified in the GEBA genome set. The file GEBA_SMR_species_table.csv has a separate entry for species in the GEBA genome set, along with the bioinformatic prediction of SMR subtype and membrane insertion direction for each SMR gene identified in the genome of that species. Dataset was generated by Christian B. Macdonald and Randy B. Stockbridge (Department of Molecular, Cellular and Developmental Biology, University of Michigan, Ann Arbor, MI, 48019) Generation of this dataset was supported by National Institutes of Health grants R35-GM128768 to Randy B. Stockbridge. Use and access: This dataset is provided as two .csv files (comma separated values) and can be read using any text editor or spreadsheet software such as Microsoft Excel. Methods: SMR genes were identified from GEBA genomes with HMMER3.3.2 using a profile Hidden Markov Model (profile HMM) constructed for the SMR family (pfam 00893). Profile HMMs for each subtype (Gdx, Qac, polyamine transport, and lipid transport) were constructed from functionally annotated clusters in a sequence similarity network of reference SMR proteins, and SMR sequences were assigned to the subtype that corresponded to the lowest e-value calculated by HMMR. SMR sequences were annotated "other" if the e-value was >10-20.