Readme file to accompany dataset DeepBlueData deposited dataset "Human challenge study 2015." The clinical shedding/symptom data, RNAseq, and wearable E4 data was partially presented in publications [1]-[3] and the cognitive lumos and VAFS data is presented in the paper [4], which is under review and embargoed. The data files are: subject.json, sample.json, and genematrix_TPM.csv. 


---------------subject.json and sample.json----------------

The json files were created by applying jsonencode() to matlab data structures. After reading in these respective json files to matlab and applying jsondecode to the json data structures you will obtain two data structures

1. A subject data structure for 18 subjects containing demographic data, the 21 visit times for biological&cognitive sample collection, shedding and symptom status, daily shedding data over the 8 days of the study, and the binary accumulated shedding label (low or high)

>> subject =
  struct with fields:

              age: [18×1 double] - age of each subject
           gender: {18×1 cell} - gender of each subject
       visit_time: [18×21 double] - time that subject visited the clinic for blood draws, cognitive tests, and E4 data downloads (excludes time at 480 hours)
         shedding: [18×8 double] - viral shedding was measured over the 4 days after inoculation (matrix padded with zeros for pre-inoculation days)
    shedding_time: [8×1 double] - the times at which shedding was measured
        ShedLabel: {18×1 cell} - the label denoting (subject ID, gender, infection status (low(0) or high(1), resp., if accumulated shedding below or above median)

2. A sample data structure containing collected steriod data, empatica E4 data, interpolated titer data (to 21 time points excluding follow up time at 480 hours, after study ended), full symptom data, lumos data over 21 time points, and the VAFS data over 21 time points.  

>> sample = 

  struct with fields:

    steroid: [1×1 struct] - the raw steroid data 
         E4: [1×1 struct] - the Empatica E4 wearable extracted features
      titer: [1×1 struct] - the shedding data
    symptom: [1×1 struct] - the raw symptom data
      lumos: [1×1 struct] - the cognitive lumos testing data
       vafs: [1×1 struct] - the cognitive VAFS data


>> sample.steroid = 

  struct with fields:

         raw_data: [3×396 double] - 3 steroids were assayed at all 18x22 time points (includes followup time at 480 hours post-inoculation)
          subject: [396×1 double] - subject label for each of the 396 samples
             time: [396×1 double] - sampling time label for each of the 396 samples
            imtrx: [18×22 double] - a matrix indexing sample number to the subject (row) and the time point (col)
          allsubj: [18×1 double] - the subject ID's 
          alltime: [22×1 double] - the time points (0 denotes the inoculation time)
    steroid_names: {3×1 cell}    - names of each of the 3 assayed steroids

>> sample.E4

ans = 

  struct with fields:

              allsubj: [16×1 double] - only 16 of the subjects in challenge study had viable E4 data (See [1])
     inoculation_time: [16×1 double] - the time of inoculation for each subject
             E4_names: {4×1 cell} - the names of the 4 signals measured from the E4
    E4_variable_names: {15×1 cell} - the feature variables extracted from the E4 (Using the protocol of [1])
              E4_data: {16×1 cell} - For each subject in allsubj (indexed by row), the 15 feature values extracted over 10 minute time segments. 



>> sample.titer

ans = 

  struct with fields:

        allsubj: [18×1 double] - the indices of the 18 subjects
        alltime: [21×1 double] - the sampling times (excluding 480 hours)
    titer_score: [18×21 double] -the shedding titers interpolated (nearest neighbor) to the 21 sample times
     subj_order: [18×1 double] - the ordering of subjects in decreasing order of accumulated shedding over the entire study 


>> sample.symptom

ans = 

  struct with fields:

          allsubj: [18×1 double] - the indices of the 18 subjects
          alltime: [14×1 double] - the times that the self reporting symptom diaries were filled out  
    symptom_score: [18×14×8 double] - the raw reported symptoms along 8 symptom categories with scores from 0-5 (0 no symptom)
        score_sum: [18×14 double] - the modified Jackson score over the 8 symptoms
    symptom_names: {8×1 cell} - the names of each of the symptoms subjects were asked to score
       subj_order: [18×1 double] - the ordering of subjects in decreasing order of accumulated modified Jackson score over the entire study


>> sample.lumos

ans = 

  struct with fields:

    innoculation_time: [18×1 double] - the innoculation times for each subject 
              allsubj: [18×1 double] - the subject indices
          lumos_names: {4×1 cell} - the names of the 4 principal categories of 18 NCPT scores extracted from the Lumos cognitive test (using protocol in [2])
                 data: {18×1 cell} - the NCPT data for the 18 subjects measured as 18 NCPT variables over a number of testing sessions (ranging from 18 to 24)
             rownames: {19×1 cell} - the names of the lumos variables (first row is time and 18 remaining rows are NCPT variable names) 

>> sample.vafs

ans = 

  struct with fields:

           vafs_raw: [462×4 table] - Raw Visual Analog Fatigue Scale (VAFS) scores
    vafs_scores_pre: [18×2 double] - VAFS averaged over the pre-inoculation period. First column is subject index and second col is average VAFS. 

>> sample.vafs.vafs_raw

ans = 

  462×1 struct array with fields:

    Subject - the subject index
    Day - the day (day zero is the inoculation day - day 4)
    Time - the time of day (AM, PM1, PM2)
    VAFS - the visual VAFS for the particular subject, day  and time.   

   
----------genematrix_TPM.csv------------------------------ 
This comma seperated matrix is a TPM normalized matrix of RNAseq read counts that were extracted from 790 paired-end FastQ files corresponding to 395 blood samples drawn from 18 subjects over 8 days, approximately 3 times per day. The columns of the matrix are indexed by subject ID, day (DayMinus3 to Day 4), and time (AM, PM1, PM2) of the blood sample collected in the challenge study experiment (described in [2]-[4]).  The read counts in each FASTQ file were aligned and mapped to the reference genome Homo_sapiens.GRCh38.84 using HISAT2 2.0.4 software. 

-------------------------------------------------------------
 

References
 
[1] X She, Y Zhai, R Henao, CW Woods, C Chiu, Geoffrey S. Ginsburg, Peter X.K. Song, AO. Hero, “Adaptive multi-channel event segmentation and feature extraction for monitoring health outcomes,” IEEE Transactions on Biomedical Engineering, vol. 68, no. 8, pp. 2377-2388, Aug. 2021, doi: 10.1109/TBME.2020.3038652. Available on arxiv:2008.09215  

[2] Emilia Grzesiak, Brinnae Bent, Micah T. McClain, Christopher W. Woods, Ephraim L. Tsalik, Bradly P. Nicholson, Timothy Veldman,  Thomas W. Burke, Zoe Gardener, Emma Bergstrom,  Ronald B. Turner, Christopher Chiu, P. Murali Doraiswamy, Alfred Hero, Ricardo Henao, Geoffrey S. Ginsburg, Jessilyn Dunn Assessment of the Feasibility of Using Noninvasive Wearable Biometric Monitoring Sensors to Detect Influenza and the Common Cold Before Symptom Onset. JAMA Netw Open. 2021;4(9):e2128534. doi:10.1001/jamanetworkopen.2021.28534 

[3] E Sabeti, S Oh, PX Song, A Hero. “A Pattern Dictionary Method for Anomaly Detection,” Entropy, vol 24, pp. 1095 Aug 2022. doi: 10.3390/e24081095

[4] Y. Zhai, P. Murali Doraiswamy, Christopher W. Woods, Ronald B. Turner, Thomas W Burke, Geoffrey S Ginsburg, Alfred Hero, “Pre-exposure cognitive performance variability is associated with severity of respiratory infection,” submitted and in second round review 2022.