This readme file was generated on 2025-02-13 by Laura M Haynes

GENERAL INFORMATION

Title of Dataset: High-throughput amino acid-level characterization of the interactions of plasminogen activator inhibitor-1 with variably divergent proteases

Dataset Creators: 

Principal Investigator Information
Name: David Ginsburg
Institution: University of Michigan
Address: Life Sciences Institute	
Email: ginsburg@umich.edu
ORCID: 0000-0002-6436-8942

Primary Author Information
Name: Laura M. Haynes
Institution: University of Michigan
Address: Life Sciences Institute
Email: hayneslm@umich.edu
ORCID: 0000-0002-2237-659X

Author Information
Name: Matthew L. Holding
Institution: University of Michigan
Address: Life Sciences Institute
Email: venomatt@umich.edu
ORCID: 0000-0003-3477-3012

Name: Hannah L DiGionvanni
Institution: University of Michigan
Address: Life Sciences Institute
Email: hannahdg@umich.edu

Name: David Siemieniak
Institution: University of Michigan
Address: Life Sciences Institute
Email: siemieni@umich.edu

Date of data collection: Data was collected in 2021

Information about funding sources that supported the collection of the data: This research was funded by the National Institutes of Health and the University of Michigan Frankel Cardiovascular Center.

SHARING/ACCESS INFORMATION

Licenses/restrictions placed on the data: http://creativecommons.org/licenses/by-nc/4.0/

Links to publications that cite or use the data: Haynes LM, Holding ML, DiGiovanni H, Siemieniak D, Ginsburg D. High-throughput amino acid-level characterization of the interactions of plasminogen activator inhibitor-1 with variably divergent proteases. bioRxiv [Preprint]. 2024 Sep 20:2024.09.16.612699. doi: 10.1101/2024.09.16.612699. PMID: 39345533; PMCID: PMC11429915.


DATA & FILE OVERVIEW

File List:
3392-LH-1_CACGATAT-AGATCTCG_S58_R1_001.fastq.gz
3392-LH-1_CACGATAT-AGATCTCG_S58_R2_001.fastq.gz
3392-LH-1_CACTCAAT-AGATCTCG_S59_R1_001.fastq.gz
3392-LH-1_CACTCAAT-AGATCTCG_S59_R2_001.fastq.gz
3392-LH-1_CAGGCGAT-AGATCTCG_S60_R1_001.fastq.gz
3392-LH-1_CAGGCGAT-AGATCTCG_S60_R2_001.fastq.gz
3392-LH-1_CATGGCAT-AGATCTCG_S61_R1_001.fastq.gz
3392-LH-1_CATGGCAT-AGATCTCG_S61_R2_001.fastq.gz
3392-LH-1_CATTTTAT-AGATCTCG_S62_R1_001.fastq.gz
3392-LH-1_CATTTTAT-AGATCTCG_S62_R2_001.fastq.gz
3392-LH-1_CCAACAAT-AGATCTCG_S63_R1_001.fastq.gz
3392-LH-1_CCAACAAT-AGATCTCG_S63_R2_001.fastq.gz
3392-LH-1_CGGAATAT-AGATCTCG_S64_R1_001.fastq.gz
3392-LH-1_CGGAATAT-AGATCTCG_S64_R2_001.fastq.gz
3392-LH-1_CTAGCTAT-AGATCTCG_S65_R1_001.fastq.gz
3392-LH-1_CTAGCTAT-AGATCTCG_S65_R2_001.fastq.gz
3392-LH-1_CTATACAT-AGATCTCG_S66_R1_001.fastq.gz
3392-LH-1_CTATACAT-AGATCTCG_S66_R2_001.fastq.gz
3936-LH-1_ACTGATAT-AGATCTCG_S80_R1_001.fastq.gz
3936-LH-1_ACTGATAT-AGATCTCG_S80_R2_001.fastq.gz
3936-LH-1_ATGAGCAT-AGATCTCG_S81_R1_001.fastq.gz
3936-LH-1_ATGAGCAT-AGATCTCG_S81_R2_001.fastq.gz
3936-LH-1_ATTCCTAT-AGATCTCG_S82_R1_001.fastq.gz
3936-LH-1_ATTCCTAT-AGATCTCG_S82_R2_001.fastq.gz
3936-LH-1_CAAAAGAT-AGATCTCG_S83_R1_001.fastq.gz
3936-LH-1_CAAAAGAT-AGATCTCG_S83_R2_001.fastq.gz
3936-LH-1_CAACTAAT-AGATCTCG_S84_R1_001.fastq.gz
3936-LH-1_CAACTAAT-AGATCTCG_S84_R2_001.fastq.gz
3936-LH-1_CACCGGAT-AGATCTCG_S85_R1_001.fastq.gz
3936-LH-1_CACCGGAT-AGATCTCG_S85_R2_001.fastq.gz
3936-LH-1_CACGATAT-AGATCTCG_S86_R1_001.fastq.gz
3936-LH-1_CACGATAT-AGATCTCG_S86_R2_001.fastq.gz
3936-LH-1_CACTCAAT-AGATCTCG_S87_R1_001.fastq.gz
3936-LH-1_CACTCAAT-AGATCTCG_S87_R2_001.fastq.gz
3936-LH-1_CAGGCGAT-AGATCTCG_S88_R1_001.fastq.gz
3936-LH-1_CAGGCGAT-AGATCTCG_S88_R2_001.fastq.gz
3936-LH-1_CATGGCAT-AGATCTCG_S89_R1_001.fastq.gz
3936-LH-1_CATGGCAT-AGATCTCG_S89_R2_001.fastq.gz
3936-LH-1_CATTTTAT-AGATCTCG_S90_R1_001.fastq.gz
3936-LH-1_CATTTTAT-AGATCTCG_S90_R2_001.fastq.gz
3936-LH-1_CCAACAAT-AGATCTCG_S91_R1_001.fastq.gz
3936-LH-1_CCAACAAT-AGATCTCG_S91_R2_001.fastq.gz
4641-LH-1_ACTGATAT-AGATCTCG_S321_R1_001.fastq.gz
4641-LH-1_ACTGATAT-AGATCTCG_S321_R2_001.fastq.gz
4641-LH-1_AGTCAAAT-AGATCTCG_S309_R1_001.fastq.gz
4641-LH-1_AGTCAAAT-AGATCTCG_S309_R2_001.fastq.gz
4641-LH-1_CGTACGAT-AGATCTCG_S318_R1_001.fastq.gz
4641-LH-1_CGTACGAT-AGATCTCG_S318_R2_001.fastq.gz
4641-LH-1_GAGTGGAT-AGATCTCG_S319_R1_001.fastq.gz
4641-LH-1_GAGTGGAT-AGATCTCG_S319_R2_001.fastq.gz
4641-LH-1_GATCAGAT-AGATCTCG_S306_R1_001.fastq.gz
4641-LH-1_GATCAGAT-AGATCTCG_S306_R2_001.fastq.gz
4641-LH-1_GGCTACAT-AGATCTCG_S308_R1_001.fastq.gz
4641-LH-1_GGCTACAT-AGATCTCG_S308_R2_001.fastq.gz
4641-LH-1_GGTAGCAT-AGATCTCG_S320_R1_001.fastq.gz
4641-LH-1_GGTAGCAT-AGATCTCG_S320_R2_001.fastq.gz
4641-LH-1_TAGCTTAT-AGATCTCG_S307_R1_001.fastq.gz
4641-LH-1_TAGCTTAT-AGATCTCG_S307_R2_001.fastq.gz
*a key to the data sets can be found in the accompanying file: "Key_to_FASTQ_files_v2.xlsx"

Script list with descriptions:
screen_amplicon.pl: compare consensus and reference sequences to call amino acid substitutions
process_proc.pl: subroutines for translating DNA sequences, assessing quality scores, and comparing paired-end reads for mismatches
complete_blast_cluster.pl: BLAST alignment and categorize sequences by alignment quality
combine_r1_r2.pl: aligns R1 and R2 sequencing reads
clean_convert_to_fasta.pl: processes FASTQ files
start_blast_cluster.sh: generates and submits SLURM batch job scripts for BLAST nucleotide searches
screen_amplicon_cluster.sh: Bash script that runs screen_amplicon.pl
concat_blast_result_files.sh: Concatenates BLAST results files fo R1 and R2 sequencing reads
complete_blast_cluster.sh: Bash script that executes complete_blast_cluster.pl
combine_r1_r2_cluster.sh: Bash script that executes combine_r1_r2.pl
clean_convert_cluster.sh: Bash script that executes clean_convert_to_fasta.pl
_0_Pkgs%Libraries.R: packages and libraries necessary to execute R scripts
_1_DESeq2.R: Executes DESeq2 analysis of counts per amino acid substitution determined from associated FASTQ files
_2_Compare_DESeq2_results.R: Determines significance thresholds for the data sets and compares datasets
_3_Compare_to_ConSurf: Compares DESeq2 results to ConSurf evolutionary conservation scores at each amino acid position in PAI-1 (data_scores.txt)
_4_DMSheatmaps.R: Generates heatmaps of the DMS data
*A permanent link to scripts can be found at: https://github.com/hayneslm/PAI-1_and_divergent_proteases

Other files needed to execute scripts:
data_scores.txt: ConSurf evolutionary conservation scores
WTbg_0h_screen: Original screen of the WT PAI-1 library to determine functional variants (Huttinger, Z.M., Haynes, L.M., Yee, A. et al. Deep mutational scanning of the plasminogen activator inhibitor-1 functional landscape. Sci Rep 11, 18827 (2021). https://doi.org/10.1038/s41598-021-97871-7)

METHODOLOGICAL INFORMATION

Description of methods used for collection/generation of data:
This data set was collected using a phage display PAI-1 library that was screened for its ability to inhibit different serine proteases (uPA, TMPRSS2, factor XIIa). The variants were identified using Illumina high-throughput DNA sequencing. The raw FASTQ files are contained in this data set.

Methods for processing the data: The software needed to analyze these files can be found contained within this dataset and at  https://github.com/hayneslm/PAI-1_and_divergent_proteases.

Instrument- or software-specific information needed to interpret the data: Code is executed in bash, perl, and R programming languages

People involved with sample collection, processing, analysis and/or submission: Laura M Haynes, Matthew L Holding, David Siemieniak