Synonyms

This is the documentation for 3.0.2 version, which is not the latest version. Consider upgrading to 4.3.0.

Use synonyms to vary your output is a good practice. For some background about the approaches read Ehud Reiter’s article about synonyms in NLG.

You can output simple synonyms (words etc.) with syn and complex ones (which use mixins or other synonyms etc.) using synz > syn structure.

The algorithm that chooses the synonym to output works like that:

  • It is random based (nothing fancy but efficient), or sequence based (one after the other), depending on the mode.

  • It eliminates empty alternatives.

  • You can ask the algorithm to globally choose the best alternative.

You should not use your own random numbers in your mixins, because it will break RosaeNLG’s ability to predict the next outputs. More about RosaeNLG and random numbers.

Basic synonyms using syn

The syn mixin is perfect for very basic synonyms.

Arguments can be single words, multiple words or anything, but not mixins. Please note that the argument is not an array: the mixin takes a variable number of arguments.

With this mixin the choice is always random. Use synz > syn structure if you want more options like sequential output.

The syn_fct function

The syn_fct is not a mixin but a standard JavaScript function. Its argument is an array. It is useful when you want random arguments in some other mixins. Remind, do not use you own random function.

Will randomly output the apple or an apple:

Complex synonyms with synz > syn structure

First example

When each synonymic alternative is complex text using mixins, syn doesn’t fit. You have to use the synz > syn structure.

You can put sentences or words or whatever you want in each syn.

Note on empty alternatives

RosaeNLG will always try to find a non-empty alternative in a synz > syn structure. When it triggers an empty one, it will go back and try to find a new one.

Choose randomly but try not to repeat

Pure random mode is default but has a drawback: the same alternative can trigger again, leading to non harmonious text. To avoid that use mode:'once'. It will trigger each alternative randomly, but will try not to repeat the same alternative. When all alternatives have been triggered, it will reset.

In general you should favor once instead of default random.

Force a specific synonym to trigger

To force a specific synonym to trigger, use synz {force:3} (to trigger the 3rd one):

This is useful while developping.
if the forced alternative is empty, it will not trigger it (and will trigger a non empty one).

Weights of each alternative

If you want to favor an alternative more than the other, you can put a higher weight on the one you prefer:

The the one I prefer option will be triggered much more often (probability is 3/5).

weight must be a strictly positive integer.

It is generally a bad practice to use weight and mode: 'once' in the same structure: once an alternative has been triggered, it will be avoided whatever its weight.

Choose each synonym alternative one after the other

Sometimes random is not the right way. You might prefer to trigger the first alternative, then the second one, etc. Put the mode parameter to sequence to do that.

When called 5 times, will output: first second third fourth first

weight parameter is meaningless in sequence mode.

Global synonym mode

Possible values for more are:

  • random (default)

  • sequence

  • once

By default, the synonyms are choosen randomly (random), and you can locally change this behavior using sequence or once mode. But you can change the behavior globally using defaultSynoMode.

When you have changed defaultSynoMode, you can still change the default behavior locally using another mode.

using once as defaultSynoMode and setting sequence locally is a popular setting.

Choosing the best alternative globally with choosebest

Introduction and first example

The standard synonym algorithm is and should be good enough for most usages. When there are non elegant repetitions in the generated texts, the first reflex should be to do local fixes with using {mode:'sequence'}.

choosebest works the following:

  • it generates dozens of texts on a section, whatever its size or what is contains

  • it chooses the textual alternative that contains the least close repetitions

For instance, if stone gem and jewel are synonyms, ranking from best to worst: stone gem jewel / stone gem stone / stone stone gem / stone stone stone.

Let’s take a first example:

eachz i in [1,2,3] with {separator: ' '}
  synz
    syn
      | stone
    syn
      | jewel
    syn
      | gem

If you run that, you will get randomly gem jewel jewel or stone gem stone etc. - sometimes gem jevel stone if you are lucky.

Let’s use choosebest:

It will generate a 100 times the same text and take the best alternative. Unless you are very unlucky, you are sure to get gem jevel stone (still in a random order).

Usage

You can put choosebest anywhere to optimize synonyms in a section of text but you should use it at a paragraph level.

choosebest has a heavy impact on performance as the texts are generated multiple times. Use it cautiously only when required.
you cannot imbricate choosebest structures. But in a same template you can use multiple choosebest structures one after the other, for instance on each paragraph.

Advanced options

How it works

The scoring algorithm works like this:

  • single words are extracted thanks to a tokenizer wink-tokenizer, and lowercased

  • stopwords are removed (you can customize the list of stopwords)

  • when the same word appears multiples times, it raises the score depending on the distance of the two occurrences (if the occurrences are closes it raises the score a lot).

Max attempt

To indicate the maximum attempts to find the best alternative:

  • among local parameter: choosebest {among:20}

  • defaultAmong global parameter: rosaenlgPug.render(myTemplate, { language: 'en_US', defaultAmong:10 })

  • default is 5

Stop words customization

You can customize locally the list of stop words with:

  • stop_words_add string[]: list of stopwords to add to the standard stopwords list (NB: stop_words_add will be automatically lowercased)

  • stop_words_remove string[]: list of stopwords to remove to the standard stopwords list

  • stop_words_override string[]: replaces the standard stopword list (which is per language)

will output newStopWord newStopWord AAA newStopWord BBB.

choosebest param
  synz
    syn
      | thus thus thus AAA BBB
    syn
      | AAA AAA

will output AAA AAA, because thus is not considered as a stop word no more.

The standard list of stop words per language is here.

Force identical elements

Sometimes you want to say that 2 or more words should be considered as identical in terms of synonyms even if they are not. Often for plurals: diamonds diamond, as there is no integrated lemmatizer, or for similar words like phone cellphone smartphone.

Use identicals string[][] with list of words that should be considered as beeing identical:

will output diamonds and pearl systematically.

How to debug

It is often difficult to understand why choosebest has chosen one alternative and not another, what is has explored, the different scores etc. You can activate traces using debug:true and get the result in debugRes:

Generate all alternatives

You may want to generate exhaustively all texts to insure that they are ok.

Even on a modest project, the combination of all possible texts can be huge.

A general advice would be:

  • When the output must be completely predictable and exhaustively tested (it can be the case for financial or medical applications), you may just consider avoid using synonyms, or use them less.

  • Use regression testing and test some of the outputs:

    • some with a fixed random seed (forceRandomSeed): must be an exact match

    • some random, but checking for invariants in the text (typically numbers, or facts)

If you really want to generate all texts, you can write a loop using forceRandomSeed:

  • put forceRandomSeed to 1, then 2, then 3 etc.

  • changing random seed will not guarantee that a different text is generated each time: keep the generated texts as keys in a Map to see if a text is new or not

  • eventually stop looping when you don’t discover enough new texts