Skip to main content
Figure 8 | Microbial Informatics and Experimentation

Figure 8

From: Large-scale experimental studies show unexpected amino acid effects on protein expression and solubility in vivo in E. coli

Figure 8

Performance of a multiparameter model predicting protein usability. The model employs the significant sequence parameters retained after stepwise multiple binary logistic regression that also fulfill the Akaike Information Criterion (see Methods section). The probability of yielding a protein with E*S > 11 is given by an equation of the form p = 1/(1+exp(-θ)), with θ being a linear combination of the significant sequence parameters. This equation models the results in the Analysis Dataset set closely up to a 65% probability of protein usability (p = 3.7 × 10-111, N = 7733) and performs similarly well on a Test Dataset comprising 1911 proteins chosen at random to be excluded from the Analysis Dataset (p = 6.8 × 10-16, N = 1911, θ' = 0.85*θ - 0.06). The graph shows the performance of the model based on ten bins at equal intervals of 0.1 in the variable θ. The squares show the fraction of usable proteins in each bin, and the error bars represent 95% confidence limits calculated based on counting statistics.

Back to article page