Date: 16 September 2021 Title of related publication: Unbiased inference of the fitness landscape ruggedness from imprecise fitness estimates Authors: Siliang Song and Jianzhi Zhang Contact: Siliang Song siliangs@umich.edu Abstract: Fitness landscapes map genotypes to their corresponding fitness under given environments and allow explaining and predicting evolutionary trajectories. Of particular interest is the landscape ruggedness or the unevenness of the landscape, because it impacts many aspects of evolution such as the likelihood that a population is trapped in a local fitness peak. Although the ruggedness has been inferred from a number of empirically mapped fitness landscapes, it is unclear to what extent this inference is affected by fitness estimation error, which is inevitable in the experimental determination of fitness landscapes. Here we address this question by simulating fitness landscapes under various theoretical models, with or without fitness estimation error. We find that all eight examined measures of landscape ruggedness are overestimated due to imprecise fitness quantification, but different measures are affected to different degrees. We devise a method to use replicate fitness measures to correct this bias and show that our method performs well under realistic conditions. We conclude that previously reported fitness landscape ruggedness is likely upward biased owing to the negligence of fitness estimation error and advise that future fitness landscape mapping should include at least three biological replicates to permit an unbiased inference of the ruggedness. Overview of data: The data contain raw and simulated fitness landscape data, as well as summary data used for plotting all figures in the paper. Data-specific Description: ./2_Effect_of_Measurement_Error/plot_data/*.pkl Each .pkl file contains summary data about the effect of measurement error for a specific model of theoretical landscape (NK, Polynomial, RMF), a number of variable sites (5, 10, 15), and a ruggedness measure (N_max, epi, r_s, open_ratio, E, gamma, adaptwalk_probs, adptwalk_steps). Files are used to generate Figs. 2, S1, S2, S3. ./3_Ruggedness_Error_Curve/plot_data/*.pkl Each .pkl file contains summary data about the ruggedness-error curve for a specific model of theoretical landscape (NK, Polynomial, RMF), a number of variable sites (5, 10, 15), and a ruggedness measure (N_max, epi, r_s, open_ratio, E, gamma, adaptwalk_probs, adptwalk_steps). Files are used to generate Figs. 3, S4. ./4_Extrapolation_Evaluation/raw_data/*_raw.pkl Each .pkl file contains raw results about the prediction of 8 extrapolation methods for a specific model of theoretical landscape (NK, Polynomial, RMF), a number of variable sites (5, 10, 15), and a ruggedness measure (N_max, epi, r_s, open_ratio, E, gamma, adaptwalk_probs, adptwalk_steps). ./4_Extrapolation_Evaluation/extrapolation_model_selection_result.pkl The file contains summary results about the performance of 8 extrapolation methods, and is generated by evaluating the prediction results recorded in .pkl files in ./4_Extrapolation_Evaluation/raw_data/. The file is used to generate Figs. 4, S5-11. ./5_Empirical_Extrapolation/SD_seq/SD_seq_arti_data.csv The file contains raw genotype-expression data of SD sequences in E. coli (Kuo et al., 2020). ./5_Empirical_Extrapolation/SD_seq/*_plot.pkl Each .pkl file contains summary data of the extrapolation results on SD sequence landscapes. files are used to generate Figs. 5, S12. ./5_Empirical_Extrapolation/trna_Domingo/trna_Domingo_data.csv The file contains raw genotype-fitness data of a yeast tRNA (Domingo et al., 2018). ./5_Empirical_Extrapolation/trna_Domingo/*_plot.pkl Each .pkl file contains summary data of the extrapolation results on Domingo et al.'s tRNA landscapes. files are used to generate Figs. 5, S12. ./5_Empirical_Extrapolation/trna_Li/All_data_df.pkl The file contains raw genotype-fitness data of a yeast tRNA (Li et al., 2016). ./5_Empirical_Extrapolation/trna_Li/*_plot.pkl Each .pkl file contains summary data of the extrapolation results on Li et al.'s tRNA landscapes. Files are used to generate Figs. 5, S12. ./6_Model_Parameters_Effect/FL_stratified/*_stratified.pkl Each .pkl files contains parameter-stratified simulated fitness landscape data with one of three theoretical fitness landscape model (NK, Polynomial, RMF). ./6_Model_Parameters_Effect/plot_df_data/*_plot_df.pkl Each .pkl files contain ruggedness data for simuilated landscapes in ./6_Model_Parameters_Effect/FL_stratified/ Four ruggedness measure (N_max, epi, r_s, open_ratio) are considered in separate files. Files are used to generate Fig. S14. ./FL_data_3X10/*_landscape_3X10.pkl Each .pkl files contains 3X10 ruggedness-stratified simulated fitness landscapes with three ruggedness level (low, middle, high). Fitness landscapes in each file are simulated by a specific theoretical landscape model (NK, Polynomial, RMF), a number of variable sites (5, 10, 15), and stratified by a specific ruggedness measure (N_max, epi, r_s, open_ratio, E, gamma, adaptwalk_probs, adptwalk_steps). Files are used for drawing ruggedness-error curve (Figs. 3, S4) and for extrapolation evaluation (Figs. 4, S5-11) ./FL_data_100X10/*_landscape_list_100X10.pkl Each .pkl files contains 100X10 simulated fitness landscapes. Fitness landscapes in each file are simulated by a specific theoretical landscape model (NK, Polynomial, RMF), a number of variable sites (5, 10, 15). Files are used for evaluating the effect of measurement error on ruggedness inference (Figs. 2, S1-3) ./index_file/*.pkl Pre-calculated index files that help improve speed of ruggedness calculation. Data-specific Methodology: The code processing the data can be found at https://github.com/song88180/fitness-landscape-error All .pkl files are generated using pickle 4.0 with python 3.8, Jupyter notebook 6.3.0, Annaconda 4.10.3 Python objects (list, numpy array, or pandas dataframe) are written to .pkl files by: with open('./path/to/file','wb') as f: pickle.dump(object, f) And data be load from .pkl files to python by: with open('./path/to/file','rb') as f: object = pickle.load(f) Sources referenced: Kuo, S. T., Jahn, R. L., Cheng, Y. J., Chen, Y. L., Lee, Y. J., Hollfelder, F., ... & Chou, H. H. D. (2020). Global fitness landscapes of the Shine-Dalgarno sequence. Genome research, 30(5), 711-723. Domingo, J., Diss, G., & Lehner, B. (2018). Pairwise and higher-order genetic interactions during the evolution of a tRNA. Nature, 558(7708), 117-121. Li, C., Qian, W., Maclean, C. J., & Zhang, J. (2016). The fitness landscape of a tRNA gene. Science, 352(6287), 837-840. Use and Access: These data are made available under a Creative Commons Attribution Non-Commercial license (CC BY-NC 4.0). To cite data: --For the related publication: Song, S., and J. Zhang (2021) Unbiased inference of the fitness landscape ruggedness from imprecise fitness estimates. Evolution, in press. --For the dataset: Song, S., and J. Zhang (2021) Unbiased inference of the fitness landscape ruggedness from imprecise fitness estimates [Data set]. University of Michigan Deep Blue Data Repository. https://doi.org/10.7302/0kzc-az82