Commit 6c4222e4 authored by Philippe Veber's avatar Philippe Veber
Browse files

started a note on multinomial

parent 943edcc8
...@@ -14,3 +14,4 @@ __pycache__ ...@@ -14,3 +14,4 @@ __pycache__
*.seed *.seed
*.bck *.bck
.vscode/ .vscode/
notes/*.html
...@@ -93,3 +93,6 @@ clean-test: ...@@ -93,3 +93,6 @@ clean-test:
.PHONY: format .PHONY: format
format: format:
ocp-indent -i lib/*.ml lib/*.mli app/*.ml ocp-indent -i lib/*.ml lib/*.mli app/*.ml
notes/%.html: notes/%.md
scirep --libs bistro.ppx,bistro.utils,codepi $< $@
# Multinomial method for detecting convergence
A simple although theoretically wrong method for detecting a
difference in amino-acid composition in relation to a binary feature
of species is to gather amino-acid counts in each group of extant
species, and apply a test against the null hypothesis that the amino
acid composition (or profile) is the same in the two groups.
We investivate here how this simple approach behaves compared to more
sophisticated models that take the species phylogeny into account.
```ocaml
module Top = Bistro_utils.Toplevel_eval.Make(
struct
let np = 8
let mem = 10
end)()
;;
module Show = struct
include Scirep.Show
let png_rplot f = png (fun fn ->
OCamlR_grDevices.png fn ;
f () ;
OCamlR_grDevices.dev_off ()
)
let cached_png_rplot f =
let f = fun%workflow dest ->
OCamlR_grDevices.png dest ;
[%eval f] () ;
OCamlR_grDevices.dev_off ()
in
Bistro.Workflow.path_plugin ~descr:"cached_png_plot" f
|> Top.path
|> png_file
end
;;
open Core;;
open Top;;
open Codepi;;
module Tk = Codepitk;;
module Pipeline = Simulation_pipeline.Mutsel
```
For that purpose we use simulations under the mutsel model.
```ocaml
let n_h0 = 900
let n_ha = 100
let tree_path = "data/besnard2009/besnard2009.nhx"
let tree = Simulation_pipeline.NHX (Bistro.Workflow.input tree_path)
let profiles = "example/aa_fitness/263SelectedProfiles.tsv"
let sim = Pipeline.query ~tree ~profiles ~n_h0 ~n_ha ()
```
```ocaml
let benchmark1 =
Pipeline.(benchmark sim [ multinomial_asymptotic_lrt ; tdg09 ])
;;
let b = Top.eval benchmark1 in
List.zip_exn b.method_labels b.average_precision;;
```
This result suggests that multinomial performs equally well as tdg09
while not taking into account the phylogeny. Surprising.
## p value distribution under non convergent sites
Let's look at those results from a different angle. Since we perform a
test, it is expected to find a uniform distribution for pvalues under
the null hypothesis.
Let's check it:
```ocaml
let benchmark_output (b : Pipeline.benchmark) key =
let assoc = List.zip_exn b.method_labels b.method_outputs in
List.Assoc.find_exn assoc key ~equal:String.equal
;;
Show.png_rplot (fun () ->
let benchmark = eval benchmark1 in
let pvals =
benchmark_output benchmark "Multinomial_1mp"
|> Array.sub ~pos:0 ~len:benchmark.n_h0
|> Array.filter_opt
|> Array.map ~f:(fun x -> 1. -. x)
in
let main = "Multinomial LRT pvalue distribution on non convergent sites" in
ignore (OCamlR_graphics.hist ~xlab:"pvalue" ~main pvals : OCamlR_graphics.hist)
)
;;
```
This is far from uniform, and to be honest no surprise here:
uniformity is expected when the data is generated under the null *of
the test*. Here we are working under a different model. So maybe, just
to be sure, we can check that our test pvalues are uniform under the
test's null.
## p value distribution under the test's null hypothesis
LRT's null when simulating under dense profiles looks indeed uniform:
```ocaml
Show.cached_png_rplot (fun%workflow () ->
let open Tk.Multinomial_test in
uniformity_test ~alpha:1. ~k:5 ~n1:200 ~n2:200 LRT.asymptotic_test
)
;;
```
But it deteriorates for a smaller number of observations:
```ocaml
Show.cached_png_rplot (fun%workflow () ->
let open Tk.Multinomial_test in
uniformity_test ~alpha:1. ~k:5 ~n1:20 ~n2:20 LRT.asymptotic_test
)
;;
```
Or when increasing the number of categories:
```ocaml
Show.cached_png_rplot (fun%workflow () ->
let open Tk.Multinomial_test in
uniformity_test ~alpha:1. ~k:20 ~n1:200 ~n2:200 LRT.asymptotic_test
)
;;
```
Or for sparse probability distribution:
```ocaml
Show.cached_png_rplot (fun%workflow () ->
let open Tk.Multinomial_test in
uniformity_test ~alpha:0.1 ~k:5 ~n1:200 ~n2:200 LRT.asymptotic_test
)
;;
```
This last problem can be partially fixed by adjusting the degree of
freedom of the LRT to the set of categories that are effectively
observed:
```ocaml
Show.cached_png_rplot (fun%workflow () ->
let open Tk.Multinomial_test in
uniformity_test ~alpha:0.1 ~k:5 ~n1:200 ~n2:200 (LRT.asymptotic_test ~df_correction:true)
)
;;
```
Part of high pvalues are caused by sites with only a single observed
amino-acid (which lead to pvalues of 1). Once we remove them we
obtain:
```ocaml
Show.cached_png_rplot (fun%workflow () ->
let open Tk.Multinomial_test in
uniformity_test
~filter_out_single_category:true ~alpha:0.1
~k:5 ~n1:200 ~n2:200
(LRT.asymptotic_test ~df_correction:true)
)
;;
```
In the end, with parameter values compatible with our setting, with
the two adjustments (heuristic for degree of freedom and filter out
trivial cases), we go from:
```ocaml
Show.cached_png_rplot (fun%workflow () ->
let open Tk.Multinomial_test in
uniformity_test
~alpha:0.1
~k:20 ~n1:20 ~n2:20
(LRT.asymptotic_test ~df_correction:false)
)
;;
```
to
```ocaml
Show.cached_png_rplot (fun%workflow () ->
let open Tk.Multinomial_test in
uniformity_test
~filter_out_single_category:true
~alpha:0.1 ~k:20
~n1:20 ~n2:20
(LRT.asymptotic_test ~df_correction:true)
)
;;
```
Resorting to a permutation test is a lot better (but also a lot more
costly):
```ocaml
Show.cached_png_rplot (fun%workflow () ->
let open Tk.Multinomial_test in
uniformity_test
~filter_out_single_category:true ~alpha:0.1
~k:20 ~n1:20 ~n2:20
(Permutation.test ~statistic:LRT.likelihood_log_ratio)
)
;;
```
## Simulating under tdg09's null distribution
(executable
(name codepi_multinomial_note)
(libraries codepi scirep)
(modules Codepi_multinomial_note)
(preprocess (pps bistro.ppx)))
(rule
(target codepi_multinomial_note.ml)
(deps codepi_multinomial_note.md)
(action (with-stdout-to %{target} (run ocaml-mdx pp %{deps}))))
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment