Visualizing Language Model Tensors (Embeddings) in TensorFlow's TensorBoard

[TensorBoard Projector: PCA; t-SNE; ...]

Source Persagen.com
Author Dr. Victoria A. Stuart, Ph.D.
Created 2019-11-28
Last modified
Summary Evaluation of biomedical contextual language models, visualization
Contents

Background

I have been evaluating some contextual language models for biomedical natural language processing (BioNLP). Several platforms support these models, including

Visualizations

Several of the “general-use” packages mentioned above provide opportunities for visualizing natural language tags and embeddings. For example, spaCy visualizers [note this one] allow the visualization of color-coded entities – as I describe here (I also describe those CRAFT corpus entity labels here.)

Likewise, I was intrigued by this example, Visualizing spaCy vectors in TensorBoard, on the spaCy examples page. It’s apparently possible to view those embeddings (tensors) in the TensorFlow Embedding Projector [example]!

I was looking at Flair embeddings at the time (2019-11-27; awaiting the anticipated release of a BioFlair pretrained model), so I thought I’d try to demo the viewing of those embeddings in TensorFlow’s Projector.

Having installed Flair, Torch / PyTorch, TensorFlow, etc. in that Py3.7 venv, I proceeded to figure out how to load the Flair embeddings in TF Projector. The following code provides a step-by-step explanation.

Flair embeddings (tensors) → Tensorflow TensorBoard Embedding Projector

[Click here to read the following code as a single (monochromatic, plain text) file in the browser. ]

Install TensorBoard (Py3.7 venv)

    
      # Install Python 3.7 in Python 3.8 env:
      #   https://stackoverflow.com/a/58964629/1904943

      # Test (in terminal):

        [victoria@victoria ~]$ date
          Wed 20 Nov 2019 04:25:38 PM PST

        [victoria@victoria ~]$ p37    ## ~/.bashrc alias
          [Python 3.7 venv (source ~/venv/py3.7/bin/activate)]

        (py3.7) [victoria@victoria ~]$ env | grep -i virtual
          VIRTUAL_ENV=/home/victoria/venv/py3.7
    

    
      (py3.7) [victoria@victoria ~]$ python --version
        Python 3.7.4

      (py3.7) [victoria@victoria ~]$ pip install --upgrade pip
        ...
        Successfully installed pip-19.3.1

      ## https://github.com/lanpa/tensorboardX
      ## Also installs (if I recall) tensorflow, other dependencies:
      (py3.7) [victoria@victoria ~]$ pip install tensorboardX    ## << note: capital X
        ...
        ## If needed:  pip install moviepy

      (py3.7) [victoria@victoria ~]$ pip install flair
        ...
        Successfully installed
          Cython-0.29.14
          SudachiPy-0.4.0
          attrs-19.3.0
          backcall-0.1.0
          boto-2.49.0
          boto3-1.10.23
          botocore-1.13.23
          bpemb-0.3.0
          certifi-2019.9.11
          cffi-1.13.2
          chardet-3.0.4
          click-7.0
          cloudpickle-1.2.2
          cycler-0.10.0
          dartsclone-0.6
          decorator-4.4.1
          deprecated-1.2.7
          docutils-0.15.2
          flair-0.4.4
          future-0.18.2
          gensim-3.8.1
          hyperopt-0.2.2
          idna-2.8
          importlib-metadata-0.23
          ipython-7.6.1
          ipython-genutils-0.2.0
          jedi-0.15.1
          jmespath-0.9.4
          joblib-0.14.0
          kiwisolver-1.1.0
          kytea-0.1.4
          langdetect-1.0.7
          matplotlib-3.1.1
          more-itertools-7.2.0
          mpld3-0.3
          natto-py-0.9.0
          networkx-2.2
          numpy-1.17.4
          packaging-19.2
          parso-0.5.1
          pexpect-4.7.0
          pickleshare-0.7.5
          pillow-6.2.1
          pluggy-0.13.0
          prompt-toolkit-2.0.10
          ptyprocess-0.6.0
          py-1.8.0
          pycparser-2.19
          pygments-2.4.2
          pymongo-3.9.0
          pyparsing-2.4.5
          pytest-5.3.0
          python-dateutil-2.8.1
          regex-2019.11.1
          requests-2.22.0
          s3transfer-0.2.1
          sacremoses-0.0.35
          scikit-learn-0.21.3
          scipy-1.3.2
          segtok-1.5.7
          sentencepiece-0.1.83
          six-1.13.0
          sklearn-0.0
          smart-open-1.9.0
          sortedcontainers-2.1.0
          sqlitedict-1.6.0
          tabulate-0.8.6
          tiny-tokenizer-3.0.1
          torch-1.3.1
          torchvision-0.4.2
          tqdm-4.38.0
          traitlets-4.3.3
          transformers-2.1.1
          urllib3-1.24.3
          wcwidth-0.1.7
          wrapt-1.11.2
          zipp-0.6.0

      (py3.7) [victoria@victoria ~]$ python
        Python 3.7.4 (default, Nov 20 2019, 11:36:53) 
        [GCC 9.2.0] on linux
        Type "help", "copyright", "credits" or "license" for more information.

      >>> import flair    ## works, yea!!  :-D
      >>> 
    
  

Start TensorBoard

    
      [victoria@victoria tensorflow]$ cd /mnt/Vancouver/apps/tensorflow/

      [victoria@victoria tensorflow]$ date; pwd; echo; ls -l

        Thu 28 Nov 2019 10:50:19 AM PST
        /mnt/Vancouver/apps/tensorflow

        total 928
        -rw-------  1 victoria victoria  19305 Nov 28 10:49 _readme-tensorflow-victoria.txt
        drwxr-xr-x 11 victoria victoria   4096 Nov 26 16:45 runs

      [victoria@victoria tensorflow]$ tensorboard --logdir runs/
        Serving TensorBoard on localhost; to expose to the network, use a proxy or pass --bind_all
        TensorBoard 2.0.0 at http://localhost:6006/ (Press CTRL+C to quit)
        ...
    
  

Obtain Flair embeddings for test sentence

    
    from flair.embeddings import FlairEmbeddings, Sentence
    from flair.models import SequenceTagger
    from flair.embeddings import StackedEmbeddings

    sentence = Sentence('The RAS-MAPK signalling cascade serves as a central node in transducing signals from membrane receptors to the nucleus.')

    tagger = SequenceTagger.load('ner')
    tagger.predict(sentence)

    embeddings_f = FlairEmbeddings('pubmed-forward')
    embeddings_b = FlairEmbeddings('pubmed-backward')

    stacked_embeddings = StackedEmbeddings([
        embeddings_f,
        embeddings_b,
    ])

    stacked_embeddings.embed(sentence)

    tokens = [str(token).split()[2] for token in sentence]
    print(tokens)
    '''
      ['The', 'RAS-MAPK', 'signalling', 'cascade', 'serves', 'as', 'a', 'central', 'node', 'in', 'transducing', 'signals', 'from', 'membrane', 'receptors', 'to', 'the', 'nucleus.']
    '''

    for token in sentence:
        print(token)
        print(token.embedding)
        print(token.embedding.shape)

    '''
      Token: 1 The
      tensor([ 0.0077, -0.0227, -0.0004,  ...,  0.1377, -0.0003,  0.0028])
      torch.Size([2300])
      Token: 2 RAS-MAPK
      tensor([-0.0007, -0.1601, -0.0274,  ...,  0.1982,  0.0013,  0.0042])
      torch.Size([2300])
      Token: 3 signalling
      tensor([ 4.2534e-03, -3.1018e-01, -3.9660e-01,  ...,  5.9336e-02,
              -9.4445e-05,  1.0025e-02])
      torch.Size([2300])
      Token: 4 cascade
      tensor([ 0.0026, -0.0087, -0.1398,  ..., -0.0037,  0.0012,  0.0274])
      torch.Size([2300])
      Token: 5 serves
      tensor([-0.0005, -0.0164, -0.0233,  ..., -0.0013,  0.0039,  0.0004])
      torch.Size([2300])
      Token: 6 as
      tensor([ 3.8261e-03, -7.6409e-02, -1.8632e-02,  ..., -2.8906e-03,
              -4.4556e-04,  5.6909e-05])
      torch.Size([2300])
      Token: 7 a
      tensor([ 0.0035, -0.0207,  0.1700,  ..., -0.0193,  0.0017,  0.0006])
      torch.Size([2300])
      Token: 8 central
      tensor([ 0.0159, -0.4097, -0.0489,  ...,  0.0743,  0.0005,  0.0012])
      torch.Size([2300])
      Token: 9 node
      tensor([ 9.7725e-03, -3.3817e-01, -2.2848e-02,  ..., -6.6284e-02,
              2.3646e-04,  1.0505e-02])
      torch.Size([2300])
      Token: 10 in
      tensor([ 0.0219, -0.0677, -0.0154,  ...,  0.0102,  0.0066,  0.0016])
      torch.Size([2300])
      Token: 11 transducing
      tensor([ 0.0092, -0.0431, -0.0450,  ...,  0.0060,  0.0002,  0.0005])
      torch.Size([2300])
      Token: 12 signals
      tensor([ 0.0047, -0.2732, -0.0408,  ...,  0.0136,  0.0005,  0.0072])
      torch.Size([2300])
      Token: 13 from
      tensor([ 0.0072, -0.0173, -0.0149,  ..., -0.0013, -0.0004,  0.0056])
      torch.Size([2300])
      Token: 14 membrane
      tensor([ 0.0086, -0.1151, -0.0629,  ...,  0.0043,  0.0050,  0.0016])
      torch.Size([2300])
      Token: 15 receptors
      tensor([ 7.6452e-03, -2.3825e-01, -1.5683e-02,  ..., -5.4974e-04,
              -1.4646e-04,  6.6120e-03])
      torch.Size([2300])
      Token: 16 to
      tensor([ 0.0038, -0.0354, -0.1337,  ...,  0.0060, -0.0004,  0.0102])
      torch.Size([2300])
      Token: 17 the
      tensor([ 0.0186, -0.0151, -0.0641,  ...,  0.0188,  0.0391,  0.0069])
      torch.Size([2300])
      Token: 18 nucleus.
      tensor([ 0.0003, -0.0461,  0.0043,  ..., -0.0126, -0.0004,  0.0142])
      torch.Size([2300])
    '''

    ## The embeddings above are PyTorch tensors (Flair depends on Torch/PyTorch).

    ## https://stackoverflow.com/questions/53903373/convert-pytorch-tensor-to-python-list
    ## https://pytorch.org/docs/stable/tensors.html#torch.Tensor.tolist

    ## https://stackoverflow.com/questions/29895602/how-to-save-output-from-python-like-tsv
    ## https://stackoverflow.com/a/29896136/1904943
    
  

[optional] Write Python output to files

In an earlier iteration of this effort I saved the Flair tokens as metadata, and the embeddings (tensors) as a list. While those files are not needed here, I leave this code for future reference.

    
      import csv

      metadata_f = 'metadata.tsv'
      tensors_f = 'tensors.tsv'

      with open(metadata_f, 'w', encoding='utf8', newline='') as tsv_file:
          tsv_writer = csv.writer(tsv_file, delimiter='\t', lineterminator='\n')
          for token in tokens:
              ## Assign to a dummy variable ( _ ) to suppress character counts;
              ## Using (token), rather than ([token]), prints spaces between all characters:
              _ = tsv_writer.writerow([token])


      '''
      [victoria@victoria tensorflow]$ cat metadata.tsv :
        The
        RAS-MAPK
        signalling
        cascade
        serves
        as
        a
        central
        node
        in
        transducing
        signals
        from
        membrane
        receptors
        to
        the
        nucleus.
      '''

      import torch    ## needed for tolist()

      with open(tensors_f, 'w', encoding='utf8', newline='') as tsv_file:
          tsv_writer = csv.writer(tsv_file, delimiter='\t', lineterminator='\n')
          for token in sentence:
              embedding = token.embedding
              ## https://stackoverflow.com/questions/12770213/writerow-csv-returns-a-number-instead-of-writing-rows
              ## assign to a dummy variable ( _ ) to suppress character counts
              ## tolist() is a PyTorch method that converts tensors to lists:
              _ = tsv_writer.writerow(embedding.tolist())

      ## CAUTION: even for the single, short sentence used in this example, the
      ## following `cat` statement generates an ENORMOUS list!

      '''
        [victoria@victoria tensorflow]$ cat tensors.tsv 
            0.007691788021475077	-0.02268664352595806	-0.0004340760060586035	...
      '''
    
  

Transform Flair tokens and tensors to NumPy arrays

    
      ##  https://stackoverflow.com/questions/40849116/how-to-use-tensorboard-embedding-projector/41177133
      ##  https://stackoverflow.com/a/41177133/1904943

      [victoria@victoria tensorflow]$ p37
        [Python 3.7 venv (source ~/venv/py3.7/bin/activate)]
    

    
      (py3.7) [victoria@victoria tensorflow]$ python
        Python 3.7.4 (default, Nov 20 2019, 11:36:53) 
        [GCC 9.2.0] on linux
        Type "help", "copyright", "credits" or "license" for more information.


      ## TEST:

      >>> import numpy as np
      >>> from torch.utils.tensorboard import SummaryWriter

      >>> vectors = np.array([[0,0,1], [0,1,0], [1,0,0], [1,1,1]])
      >>> metadata = ['001', '010', '100', '111']  # labels

      >>> print(metadata)
        ['001', '010', '100', '111']

      >>> print(vectors)
        [[0 0 1]
        [0 1 0]
        [1 0 0]
        [1 1 1]]

      >>> writer = SummaryWriter()
      >>> writer.add_embedding(vectors, metadata)
      >>> writer.close()
      >>>

      ## That (Nov 28, 2019: ~11:08 am) generated a new run, "Nov28_11-08-09_victoria",
      ## visible in the TensorFlow TensorBoard.  When I clicked that link, those data
      ## opened in the Projector!

      # ----------------------------------------------------------------------------

      >>> tokens = [str(token).split()[2] for token in sentence]

      >>> print(tokens)
      '''
        ['The', 'RAS-MAPK', 'signalling', 'cascade', 'serves', 'as', 'a', 'central',
        'node', 'in', 'transducing', 'signals', 'from', 'membrane', 'receptors',
        'to', 'the', 'nucleus.']
      '''

      >>> tokens_array = np.array(tokens)

      >>> print(tokens_array)
      '''
        ['The' 'RAS-MAPK' 'signalling' 'cascade' 'serves' 'as' 'a' 'central'
        'node' 'in' 'transducing' 'signals' 'from' 'membrane' 'receptors'
        'to' 'the' 'nucleus.']
      '''

      >>> for token in tokens_array:
              print(token)
      >>> 
      '''
        The
        RAS-MAPK
        signalling
        cascade
        serves
        as
        a
        central
        node
        in
        transducing
        signals
        from
        membrane
        receptors
        to
        the
        nucleus.
      '''

      >>> embeddings = [token.embedding for token in sentence]

      >>> print(embeddings)
      '''
        [tensor([ 0.0077, -0.0227, -0.0004,  ...,  0.1377, -0.0003,  0.0028]),
        tensor([-0.0007, -0.1601, -0.0274,  ...,  0.1982,  0.0013,  0.0042]),
        tensor([ 4.2534e-03, -3.1018e-01, -3.9660e-01,  ...,  5.9336e-02, -9.4445e-05,  1.0025e-02]),
        tensor([ 0.0026, -0.0087, -0.1398,  ..., -0.0037,  0.0012,  0.0274]),
        tensor([-0.0005, -0.0164, -0.0233,  ..., -0.0013,  0.0039,  0.0004]),
        tensor([ 3.8261e-03, -7.6409e-02, -1.8632e-02,  ..., -2.8906e-03, -4.4556e-04,  5.6909e-05]),
        tensor([ 0.0035, -0.0207,  0.1700,  ..., -0.0193,  0.0017,  0.0006]),
        tensor([ 0.0159, -0.4097, -0.0489,  ...,  0.0743,  0.0005,  0.0012]),
        tensor([ 9.7725e-03, -3.3817e-01, -2.2848e-02,  ..., -6.6284e-02, 2.3646e-04,  1.0505e-02]),
        tensor([ 0.0219, -0.0677, -0.0154,  ...,  0.0102,  0.0066,  0.0016]),
        tensor([ 0.0092, -0.0431, -0.0450,  ...,  0.0060,  0.0002,  0.0005]),
        tensor([ 0.0047, -0.2732, -0.0408,  ...,  0.0136,  0.0005,  0.0072]),
        tensor([ 0.0072, -0.0173, -0.0149,  ..., -0.0013, -0.0004,  0.0056]),
        tensor([ 0.0086, -0.1151, -0.0629,  ...,  0.0043,  0.0050,  0.0016]),
        tensor([ 7.6452e-03, -2.3825e-01, -1.5683e-02,  ..., -5.4974e-04, -1.4646e-04,  6.6120e-03]),
        tensor([ 0.0038, -0.0354, -0.1337,  ...,  0.0060, -0.0004,  0.0102]),
        tensor([ 0.0186, -0.0151, -0.0641,  ...,  0.0188,  0.0391,  0.0069]),
        tensor([ 0.0003, -0.0461,  0.0043,  ..., -0.0126, -0.0004,  0.0142])]
      '''

      import torch    ## needed for tolist()

      >>> embeddings = [token.embedding.tolist() for token in sentence]

      ##  ***  CAUTION -- EVEN FOR THIS ONE SENTENCE THIS IS AN ENORMOUS LIST!!  ***

      >>> print(embeddings)
      '''
        [[0.007691788021475077, -0.02268664352595806, ..., -0.0004157265357207507, 0.014170931652188301]]
      '''

      >>> embeddings_array = np.array(embeddings)

      >>> print(embeddings_array)
      '''
        [[ 7.69178802e-03 -2.26866435e-02 -4.34076006e-04 ...  1.37687057e-01 -3.07319278e-04  2.84141395e-03]
        [-7.38183910e-04 -1.60104632e-01 -2.73584425e-02 ...  1.98223457e-01 1.31987268e-03  4.19976842e-03]
        [ 4.25336510e-03 -3.10180396e-01 -3.96601588e-01 ...  5.93362860e-02 -9.44453641e-05  1.00254947e-02]
        ...
        [ 3.82626243e-03 -3.53914015e-02 -1.33689731e-01 ...  5.97812422e-03 -3.52837233e-04  1.01681864e-02]
        [ 1.86223574e-02 -1.51006011e-02 -6.41461909e-02 ...  1.87926367e-02 3.90900113e-02  6.87920302e-03]
        [ 2.52505066e-04 -4.60800231e-02  4.34845686e-03 ... -1.26084751e-02 -4.15726536e-04  1.41709317e-02]]
      '''
      >>> 
    
  

Start new TensorBoard instance & load those data

OK, we now have everything needed to visualize those tensors (reformatted as NumPy arrays) in TensorFlow’s Embedding Projector! :-D

    
      >>> from torch.utils.tensorboard import SummaryWriter
      >>> writer = SummaryWriter()

      ## Load those data:

      >>> writer.add_embedding(embeddings_array, tokens_array)
      >>> writer.close()
      >>> 

      ## Wait a few seconds for tensorboard, http://localhost:6006/#projector
      ## to refresh in Firefox (manually reload the browser, if needed).
      ## My new "run" appears!  "Nov28_11-08-09_victoria" !
      ##    /mnt/Vancouver/apps/tensorflow/runs/Nov28_11-54-28_victoria

      ## Yea: works!! :-D
      ## SimpleScreenRecorder video screen capture below.  :-)
    
  

TensorBoard Projector


Return to Persagen.com