This readme file was generated on 2025-02-13 by Laura M Haynes GENERAL INFORMATION Title of Dataset: High-throughput amino acid-level characterization of the interactions of plasminogen activator inhibitor-1 with variably divergent proteases Dataset Creators: Principal Investigator Information Name: David Ginsburg Institution: University of Michigan Address: Life Sciences Institute Email: ginsburg@umich.edu ORCID: 0000-0002-6436-8942 Primary Author Information Name: Laura M. Haynes Institution: University of Michigan Address: Life Sciences Institute Email: hayneslm@umich.edu ORCID: 0000-0002-2237-659X Author Information Name: Matthew L. Holding Institution: University of Michigan Address: Life Sciences Institute Email: venomatt@umich.edu ORCID: 0000-0003-3477-3012 Name: Hannah L DiGionvanni Institution: University of Michigan Address: Life Sciences Institute Email: hannahdg@umich.edu Name: David Siemieniak Institution: University of Michigan Address: Life Sciences Institute Email: siemieni@umich.edu Date of data collection: Data was collected in 2021 Information about funding sources that supported the collection of the data: This research was funded by the National Institutes of Health and the University of Michigan Frankel Cardiovascular Center. SHARING/ACCESS INFORMATION Licenses/restrictions placed on the data: http://creativecommons.org/licenses/by-nc/4.0/ Links to publications that cite or use the data: Haynes LM, Holding ML, DiGiovanni H, Siemieniak D, Ginsburg D. High-throughput amino acid-level characterization of the interactions of plasminogen activator inhibitor-1 with variably divergent proteases. bioRxiv [Preprint]. 2024 Sep 20:2024.09.16.612699. doi: 10.1101/2024.09.16.612699. PMID: 39345533; PMCID: PMC11429915. DATA & FILE OVERVIEW File List: 3392-LH-1_CACGATAT-AGATCTCG_S58_R1_001.fastq.gz 3392-LH-1_CACGATAT-AGATCTCG_S58_R2_001.fastq.gz 3392-LH-1_CACTCAAT-AGATCTCG_S59_R1_001.fastq.gz 3392-LH-1_CACTCAAT-AGATCTCG_S59_R2_001.fastq.gz 3392-LH-1_CAGGCGAT-AGATCTCG_S60_R1_001.fastq.gz 3392-LH-1_CAGGCGAT-AGATCTCG_S60_R2_001.fastq.gz 3392-LH-1_CATGGCAT-AGATCTCG_S61_R1_001.fastq.gz 3392-LH-1_CATGGCAT-AGATCTCG_S61_R2_001.fastq.gz 3392-LH-1_CATTTTAT-AGATCTCG_S62_R1_001.fastq.gz 3392-LH-1_CATTTTAT-AGATCTCG_S62_R2_001.fastq.gz 3392-LH-1_CCAACAAT-AGATCTCG_S63_R1_001.fastq.gz 3392-LH-1_CCAACAAT-AGATCTCG_S63_R2_001.fastq.gz 3392-LH-1_CGGAATAT-AGATCTCG_S64_R1_001.fastq.gz 3392-LH-1_CGGAATAT-AGATCTCG_S64_R2_001.fastq.gz 3392-LH-1_CTAGCTAT-AGATCTCG_S65_R1_001.fastq.gz 3392-LH-1_CTAGCTAT-AGATCTCG_S65_R2_001.fastq.gz 3392-LH-1_CTATACAT-AGATCTCG_S66_R1_001.fastq.gz 3392-LH-1_CTATACAT-AGATCTCG_S66_R2_001.fastq.gz 3936-LH-1_ACTGATAT-AGATCTCG_S80_R1_001.fastq.gz 3936-LH-1_ACTGATAT-AGATCTCG_S80_R2_001.fastq.gz 3936-LH-1_ATGAGCAT-AGATCTCG_S81_R1_001.fastq.gz 3936-LH-1_ATGAGCAT-AGATCTCG_S81_R2_001.fastq.gz 3936-LH-1_ATTCCTAT-AGATCTCG_S82_R1_001.fastq.gz 3936-LH-1_ATTCCTAT-AGATCTCG_S82_R2_001.fastq.gz 3936-LH-1_CAAAAGAT-AGATCTCG_S83_R1_001.fastq.gz 3936-LH-1_CAAAAGAT-AGATCTCG_S83_R2_001.fastq.gz 3936-LH-1_CAACTAAT-AGATCTCG_S84_R1_001.fastq.gz 3936-LH-1_CAACTAAT-AGATCTCG_S84_R2_001.fastq.gz 3936-LH-1_CACCGGAT-AGATCTCG_S85_R1_001.fastq.gz 3936-LH-1_CACCGGAT-AGATCTCG_S85_R2_001.fastq.gz 3936-LH-1_CACGATAT-AGATCTCG_S86_R1_001.fastq.gz 3936-LH-1_CACGATAT-AGATCTCG_S86_R2_001.fastq.gz 3936-LH-1_CACTCAAT-AGATCTCG_S87_R1_001.fastq.gz 3936-LH-1_CACTCAAT-AGATCTCG_S87_R2_001.fastq.gz 3936-LH-1_CAGGCGAT-AGATCTCG_S88_R1_001.fastq.gz 3936-LH-1_CAGGCGAT-AGATCTCG_S88_R2_001.fastq.gz 3936-LH-1_CATGGCAT-AGATCTCG_S89_R1_001.fastq.gz 3936-LH-1_CATGGCAT-AGATCTCG_S89_R2_001.fastq.gz 3936-LH-1_CATTTTAT-AGATCTCG_S90_R1_001.fastq.gz 3936-LH-1_CATTTTAT-AGATCTCG_S90_R2_001.fastq.gz 3936-LH-1_CCAACAAT-AGATCTCG_S91_R1_001.fastq.gz 3936-LH-1_CCAACAAT-AGATCTCG_S91_R2_001.fastq.gz 4641-LH-1_ACTGATAT-AGATCTCG_S321_R1_001.fastq.gz 4641-LH-1_ACTGATAT-AGATCTCG_S321_R2_001.fastq.gz 4641-LH-1_AGTCAAAT-AGATCTCG_S309_R1_001.fastq.gz 4641-LH-1_AGTCAAAT-AGATCTCG_S309_R2_001.fastq.gz 4641-LH-1_CGTACGAT-AGATCTCG_S318_R1_001.fastq.gz 4641-LH-1_CGTACGAT-AGATCTCG_S318_R2_001.fastq.gz 4641-LH-1_GAGTGGAT-AGATCTCG_S319_R1_001.fastq.gz 4641-LH-1_GAGTGGAT-AGATCTCG_S319_R2_001.fastq.gz 4641-LH-1_GATCAGAT-AGATCTCG_S306_R1_001.fastq.gz 4641-LH-1_GATCAGAT-AGATCTCG_S306_R2_001.fastq.gz 4641-LH-1_GGCTACAT-AGATCTCG_S308_R1_001.fastq.gz 4641-LH-1_GGCTACAT-AGATCTCG_S308_R2_001.fastq.gz 4641-LH-1_GGTAGCAT-AGATCTCG_S320_R1_001.fastq.gz 4641-LH-1_GGTAGCAT-AGATCTCG_S320_R2_001.fastq.gz 4641-LH-1_TAGCTTAT-AGATCTCG_S307_R1_001.fastq.gz 4641-LH-1_TAGCTTAT-AGATCTCG_S307_R2_001.fastq.gz *a key to the data sets can be found in the accompanying file: "Key_to_FASTQ_files_v2.xlsx" Script list with descriptions: screen_amplicon.pl: compare consensus and reference sequences to call amino acid substitutions process_proc.pl: subroutines for translating DNA sequences, assessing quality scores, and comparing paired-end reads for mismatches complete_blast_cluster.pl: BLAST alignment and categorize sequences by alignment quality combine_r1_r2.pl: aligns R1 and R2 sequencing reads clean_convert_to_fasta.pl: processes FASTQ files start_blast_cluster.sh: generates and submits SLURM batch job scripts for BLAST nucleotide searches screen_amplicon_cluster.sh: Bash script that runs screen_amplicon.pl concat_blast_result_files.sh: Concatenates BLAST results files fo R1 and R2 sequencing reads complete_blast_cluster.sh: Bash script that executes complete_blast_cluster.pl combine_r1_r2_cluster.sh: Bash script that executes combine_r1_r2.pl clean_convert_cluster.sh: Bash script that executes clean_convert_to_fasta.pl _0_Pkgs%Libraries.R: packages and libraries necessary to execute R scripts _1_DESeq2.R: Executes DESeq2 analysis of counts per amino acid substitution determined from associated FASTQ files _2_Compare_DESeq2_results.R: Determines significance thresholds for the data sets and compares datasets _3_Compare_to_ConSurf: Compares DESeq2 results to ConSurf evolutionary conservation scores at each amino acid position in PAI-1 (data_scores.txt) _4_DMSheatmaps.R: Generates heatmaps of the DMS data *A permanent link to scripts can be found at: https://github.com/hayneslm/PAI-1_and_divergent_proteases Other files needed to execute scripts: data_scores.txt: ConSurf evolutionary conservation scores WTbg_0h_screen: Original screen of the WT PAI-1 library to determine functional variants (Huttinger, Z.M., Haynes, L.M., Yee, A. et al. Deep mutational scanning of the plasminogen activator inhibitor-1 functional landscape. Sci Rep 11, 18827 (2021). https://doi.org/10.1038/s41598-021-97871-7) METHODOLOGICAL INFORMATION Description of methods used for collection/generation of data: This data set was collected using a phage display PAI-1 library that was screened for its ability to inhibit different serine proteases (uPA, TMPRSS2, factor XIIa). The variants were identified using Illumina high-throughput DNA sequencing. The raw FASTQ files are contained in this data set. Methods for processing the data: The software needed to analyze these files can be found contained within this dataset and at https://github.com/hayneslm/PAI-1_and_divergent_proteases. Instrument- or software-specific information needed to interpret the data: Code is executed in bash, perl, and R programming languages People involved with sample collection, processing, analysis and/or submission: Laura M Haynes, Matthew L Holding, David Siemieniak