Do not normalise NgramsTerms inside a patchmap
While investigating #631 (closed) I have tracked down the problem (or part of the problem anyway) with the way ngrams terms are normalised in the frontend before creating a PatchTable.
I have done some investigation on #631, and I wasn't able to replicate the issue with "normal" flow tests which added elements as map terms.
However, on the frontend I was able to replicate the issue with things like home—brew where — is an Unicode glyph for "em dash" https://www.compart.com/en/unicode/U+2014.
If I look at the payload that the frontend is sending at the backend, I can see how the data is being already cleaned up in a non-desireable way (see Elements/Matrix).
As you can see from the "Request Payload" we are passing the backend "home brew", without the dash.
I think the culprit might be coming from:
normNgramInternal :: CTabNgramType -> String -> String
normNgramInternal CTabAuthors = identity
normNgramInternal CTabSources = identity
normNgramInternal CTabInstitutes = identity
normNgramInternal CTabTerms = {- GS.specialCharNormalize
<<< -} S.toLower
<<< R.replace wordBoundaryReg " "
This uses a regex to normalise the ngramsTerm before building the patch to be send to the backend server. I think the frontend shouldn't do any normalisation here, but rather submit the content as-is to the backend, which has all the computation power and knowledge to normalise things in a sane way.
This function is used inside highlightNgrams but I wouldn't be surprised if that's used also when building the patchmap.
> FN.normNgram CTabTerms "home—brew"
(NormNgramsTerm "home brew")
Alas I cannot pull and push to the repo so I'm kinda stuck today, but the quickest fix would be to remove the call to normNgram from:
performAction
$ CoreAction
$ addNewNgramA (normNgram tabNgramType sq) MapTerm
Inside Gargantext.Components.NgramsTable. This is responsible for adding new map terms in the maplist UI.
However, the morally-correct fix would be, as said, to remove normNgram altogether and never normalise ngrams, even if that means that the highlight in the doc is weird, but at least that's consistent with the information the backend sees.