Date: 16 September 2021

Title of related publication: Unbiased inference of the fitness landscape ruggedness from 
imprecise fitness estimates

Authors: Siliang Song and Jianzhi Zhang
Contact: Siliang Song siliangs@umich.edu

Abstract: 
Fitness landscapes map genotypes to their corresponding fitness under given environments 
and allow explaining and predicting evolutionary trajectories.  Of particular interest is 
the landscape ruggedness or the unevenness of the landscape, because it impacts many 
aspects of evolution such as the likelihood that a population is trapped in a local 
fitness peak.  Although the ruggedness has been inferred from a number of empirically 
mapped fitness landscapes, it is unclear to what extent this inference is affected by 
fitness estimation error, which is inevitable in the experimental determination of fitness 
landscapes.  Here we address this question by simulating fitness landscapes under various 
theoretical models, with or without fitness estimation error.  We find that all eight 
examined measures of landscape ruggedness are overestimated due to imprecise fitness 
quantification, but different measures are affected to different degrees.  We devise a 
method to use replicate fitness measures to correct this bias and show that our method 
performs well under realistic conditions.  We conclude that previously reported fitness 
landscape ruggedness is likely upward biased owing to the negligence of fitness estimation 
error and advise that future fitness landscape mapping should include at least three 
biological replicates to permit an unbiased inference of the ruggedness. 

Overview of data: 
The data contain raw and simulated fitness landscape data, as well as summary data used for 
plotting all figures in the paper.

Data-specific Description:
./2_Effect_of_Measurement_Error/plot_data/*.pkl
	Each .pkl file contains summary data about the effect of measurement error for a
	specific model of theoretical landscape (NK, Polynomial, RMF), a number of variable 
	sites (5, 10, 15), and a ruggedness measure (N_max, epi, r_s, open_ratio, E, gamma, 
	adaptwalk_probs, adptwalk_steps). Files are used to generate Figs. 2, S1, S2, S3.

./3_Ruggedness_Error_Curve/plot_data/*.pkl
	Each .pkl file contains summary data about the ruggedness-error curve for 
	a specific model of theoretical landscape (NK, Polynomial, RMF), a number of variable 
	sites (5, 10, 15), and a ruggedness measure (N_max, epi, r_s, open_ratio, E, gamma, 
	adaptwalk_probs, adptwalk_steps). Files are used to generate Figs. 3, S4.


./4_Extrapolation_Evaluation/raw_data/*_raw.pkl
	Each .pkl file contains raw results about the prediction of 8 extrapolation methods 
	for a specific model of theoretical landscape (NK, Polynomial, RMF), a number of variable 
	sites (5, 10, 15), and a ruggedness measure (N_max, epi, r_s, open_ratio, E, gamma, 
	adaptwalk_probs, adptwalk_steps).

./4_Extrapolation_Evaluation/extrapolation_model_selection_result.pkl
	The file contains summary results about the performance of 8 extrapolation methods, 
	and is generated by evaluating the prediction results recorded in .pkl files in 
	./4_Extrapolation_Evaluation/raw_data/. The file is used to generate Figs. 4, S5-11.

./5_Empirical_Extrapolation/SD_seq/SD_seq_arti_data.csv
	The file contains raw genotype-expression data of SD sequences in E. coli 
	(Kuo et al., 2020).

./5_Empirical_Extrapolation/SD_seq/*_plot.pkl
	Each .pkl file contains summary data of the extrapolation results on SD sequence 
	landscapes. files are used to generate Figs. 5, S12.

./5_Empirical_Extrapolation/trna_Domingo/trna_Domingo_data.csv
	The file contains raw genotype-fitness data of a yeast tRNA (Domingo et al., 2018).

./5_Empirical_Extrapolation/trna_Domingo/*_plot.pkl
	Each .pkl file contains summary data of the extrapolation results on Domingo et al.'s 
	tRNA landscapes. files are used to generate Figs. 5, S12.

./5_Empirical_Extrapolation/trna_Li/All_data_df.pkl
	The file contains raw genotype-fitness data of a yeast tRNA (Li et al., 2016).

./5_Empirical_Extrapolation/trna_Li/*_plot.pkl
	Each .pkl file contains summary data of the extrapolation results on Li et al.'s tRNA 
	landscapes. Files are used to generate Figs. 5, S12.

./6_Model_Parameters_Effect/FL_stratified/*_stratified.pkl
	Each .pkl files contains parameter-stratified simulated fitness landscape data with one 
	of three theoretical fitness landscape model (NK, Polynomial, RMF).

./6_Model_Parameters_Effect/plot_df_data/*_plot_df.pkl
	Each .pkl files contain ruggedness data for simuilated landscapes in 
	./6_Model_Parameters_Effect/FL_stratified/ 
	Four ruggedness measure (N_max, epi, r_s, open_ratio) are considered in separate files.
	Files are used to generate Fig. S14.

./FL_data_3X10/*_landscape_3X10.pkl
	Each .pkl files contains 3X10 ruggedness-stratified simulated fitness landscapes with three 
	ruggedness level (low, middle, high). Fitness landscapes in each file are simulated by a 
	specific theoretical landscape model (NK, Polynomial, RMF), a number of variable sites 
	(5, 10, 15), and stratified by a specific ruggedness measure (N_max, epi, r_s, open_ratio, 
	E, gamma, adaptwalk_probs, adptwalk_steps). Files are used for drawing ruggedness-error 
	curve (Figs. 3, S4) and for extrapolation evaluation (Figs. 4, S5-11)

./FL_data_100X10/*_landscape_list_100X10.pkl
	Each .pkl files contains 100X10 simulated fitness landscapes. Fitness landscapes in each 
	file are simulated by a specific theoretical landscape model (NK, Polynomial, RMF), a number 
	of variable sites (5, 10, 15). Files are used for evaluating the effect of measurement 
	error on ruggedness inference (Figs. 2, S1-3)

./index_file/*.pkl
	Pre-calculated index files that help improve speed of ruggedness calculation.


Data-specific Methodology:
The code processing the data can be found at https://github.com/song88180/fitness-landscape-error
All .pkl files are generated using pickle 4.0 with python 3.8, Jupyter notebook 6.3.0, Annaconda 4.10.3
Python objects (list, numpy array, or pandas dataframe) are written to .pkl files by:
	with open('./path/to/file','wb') as f:
		pickle.dump(object, f)
And data be load from .pkl files to python by:
	with open('./path/to/file','rb') as f:
		object = pickle.load(f)


Sources referenced:
Kuo, S. T., Jahn, R. L., Cheng, Y. J., Chen, Y. L., Lee, Y. J., Hollfelder, F., ... & Chou, H. H. D. (2020). Global fitness landscapes of the Shine-Dalgarno sequence.  Genome research,  30(5), 711-723.

Domingo, J., Diss, G., & Lehner, B. (2018). Pairwise and higher-order genetic interactions during the evolution of a tRNA.  Nature,  558(7708), 117-121.

Li, C., Qian, W., Maclean, C. J., & Zhang, J. (2016). The fitness landscape of a tRNA gene.  Science,  352(6287), 837-840.


Use and Access: 
These data are made available under a Creative Commons Attribution Non-Commercial license 
(CC BY-NC 4.0).

To cite data:

 --For the related publication: 
 
Song, S., and J. Zhang (2021) Unbiased inference of the fitness landscape ruggedness from 
imprecise fitness estimates. Evolution, in press.

 --For the dataset: 
 
 Song, S., and J. Zhang (2021) Unbiased inference of the fitness landscape ruggedness from 
imprecise fitness estimates [Data set]. University of Michigan Deep Blue Data Repository. https://doi.org/10.7302/0kzc-az82