1. 20 Mar, 2021 6 commits
• tk/Mutsel_simulator_cpg: simulator with linear time complexity · 7411d86f
Philippe Veber authored
```now we can generate a million sites within a minute

> df <- data.frame(n = c(10000,30000,100000,300000), t = c(1.08,2.25,6.27,18.03)) ; fit <- lm(t ~ n, data = df) ; summary(fit)

Call:
lm(formula = t ~ n, data = df)

Residuals:
1        2        3        4
0.01851  0.01931 -0.05290  0.01509

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.769e-01  2.997e-02   15.91  0.00393 **
n           5.846e-05  1.886e-07  310.00 1.04e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.04325 on 2 degrees of freedom
Multiple R-squared:      1,	Adjusted R-squared:      1
F-statistic: 9.61e+04 on 1 and 2 DF,  p-value: 1.041e-05```
• tk: new Discrete_pd module · 3ca117c0
Philippe Veber authored
• tk/Mutsel_cpg_simulator: avoid recomputing most of the rate vectors · 8ac66b28
Philippe Veber authored
```only recompute what is affected by the state change at some
position. Complexity is still quadratic from having to sample from all
positions, but the constant is about 300 times better than last commit.

> df <- data.frame(n = c(10000,13000,20000,23000,30000), t = c(5.03,7.53,16.84,21.58,36.12)) ; fit <- lm(t ~ I(n ^ 2), data = df) ; summary(fit)

Call:
lm(formula = t ~ I(n^2), data = df)

Residuals:
1        2        3        4        5
0.05330 -0.13314  0.18311 -0.09938 -0.00389

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.083e+00  1.161e-01   9.335   0.0026 **
I(n^2)      3.893e-08  2.286e-10 170.301 4.46e-07 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.146 on 3 degrees of freedom
Multiple R-squared:  0.9999,	Adjusted R-squared:  0.9999
F-statistic: 2.9e+04 on 1 and 3 DF,  p-value: 4.464e-07```
• tk/Mutsel_sim_cpg: compute only rate vectors instead of rate matrices · 4b3d32db
Philippe Veber authored
```quadratic coefficient decreases from 1.671e-05 to 1.212e-05.

> df <- data.frame(n = c(500,1000,1300,2000), t = c(3.62,12.77,20.77,49.07)) ; fit <- lm(t ~ I(n ^ 2), data = df) ; summary(fit)

Call:
lm(formula = t ~ I(n^2), data = df)

Residuals:
1        2        3        4
0.05496  0.11786 -0.24227  0.06946

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.360e-01  1.594e-01   3.362   0.0782 .
I(n^2)      1.212e-05  7.145e-08 169.576 3.48e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.2005 on 2 degrees of freedom
Multiple R-squared:  0.9999,	Adjusted R-squared:  0.9999
F-statistic: 2.876e+04 on 1 and 2 DF,  p-value: 3.477e-05```
• tk/Mutsel_simulator_cpg: initial speed assessment · 8f248d0d
Philippe Veber authored
```using (debugged) implementation in phylogenetics, perform simulation
for 500 to 2000 sites. Quadratic complexity is expected here, to
observe it I use the log transform from

t = K n^2

to
log t = 2 log n + log K

Running times are only nearly quadratic:

> df <- data.frame(n = c(500,1000,1300,2000), t = c(5.15,18.48,28.63,68.07)) ; fit <- lm(log2(t) ~ log2(n), data = df) ; summary(fit)

Call:
lm(formula = log2(t) ~ log2(n), data = df)

Residuals:
1         2         3         4
0.015095  0.007796 -0.061122  0.038231

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -14.24278    0.36390  -39.14 0.000652 ***
log2(n)       1.85062    0.03608   51.30 0.000380 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.05237 on 2 degrees of freedom
Multiple R-squared:  0.9992,	Adjusted R-squared:  0.9989
F-statistic:  2631 on 1 and 2 DF,  p-value: 0.0003798

> df <- data.frame(n = c(500,1000,1300,2000), t = c(5.15,18.48,28.63,68.07)) ; fit <- lm(t ~ I(n^2), data = df) ; summary(fit)

Call:
lm(formula = t ~ I(n^2), data = df)

Residuals:
1       2       3       4
-0.1138  0.6815 -0.7004  0.1327

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.086e+00  5.581e-01   1.945 0.191208
I(n^2)      1.671e-05  2.501e-07  66.822 0.000224 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.702 on 2 degrees of freedom
Multiple R-squared:  0.9996,	Adjusted R-squared:  0.9993
F-statistic:  4465 on 1 and 2 DF,  p-value: 0.0002239```
