The new version of SpaCy in pre-release hase several new features i'm eagerly awaiting.

  • Transformer Model
  • Trainable Sentence Splitter
  • Improved Interface for Training
  • Groups of Overlapable Spans
import json
import spacy

from spacy.tokens import Doc
from spacy.training import Example

spacy.__version__
'3.0.0rc3'

Models

There is now a transformers based model for greater accuracy

model_acc = spacy.load("en_core_web_trf")
model_acc.components
[('transformer',
  <spacy_transformers.pipeline_component.Transformer at 0x1be4e4513b0>),
 ('tagger', <spacy.pipeline.tagger.Tagger at 0x1be4e457e50>),
 ('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x1bd9286b040>),
 ('ner', <spacy.pipeline.ner.EntityRecognizer at 0x1bdb7373ee0>),
 ('attribute_ruler',
  <spacy.pipeline.attributeruler.AttributeRuler at 0x1be0a08a600>),
 ('lemmatizer', <spacy.lang.en.lemmatizer.EnglishLemmatizer at 0x1be0a091900>)]

As well as an improved version of the model we all know and love

model_eff = spacy.load("en_core_web_sm")
model_eff.components
[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec at 0x1be4ba76860>),
 ('tagger', <spacy.pipeline.tagger.Tagger at 0x1be0a1c94a0>),
 ('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x1bdb5b896a0>),
 ('senter', <spacy.pipeline.senter.SentenceRecognizer at 0x1be4ba76b80>),
 ('ner', <spacy.pipeline.ner.EntityRecognizer at 0x1bdb5cb7fa0>),
 ('attribute_ruler',
  <spacy.pipeline.attributeruler.AttributeRuler at 0x1bdb5f156c0>),
 ('lemmatizer', <spacy.lang.en.lemmatizer.EnglishLemmatizer at 0x1bdb5f6b340>)]

Example Class

I used to dread getting everything formatted into Gold objects previously... The new Example makes a uniform and simple way of formatting data for training

predicted = Doc(model_eff.vocab, words=["Apply", "some", "sun", "screen"])

token_ref = ["Apply", "some", "sun", "screen"]
tags_ref = ["VERB", "DET", "NOUN", "NOUN"]
sent_refs = [1, 0, 0, 0]

example = Example.from_dict(predicted, {"words": token_ref, "tags": tags_ref, "sent_starts": sent_refs})
example.to_dict()
{'doc_annotation': {'cats': {}, 'entities': ['O', 'O', 'O', 'O'], 'links': {}},
 'token_annotation': {'ORTH': ['Apply', 'some', 'sun', 'screen'],
  'SPACY': [True, True, True, True],
  'TAG': ['VERB', 'DET', 'NOUN', 'NOUN'],
  'LEMMA': ['', '', '', ''],
  'POS': ['', '', '', ''],
  'MORPH': ['', '', '', ''],
  'HEAD': [0, 1, 2, 3],
  'DEP': ['', '', '', ''],
  'SENT_START': [1, 0, 1, 0]}}

A Statistical Sentence Splitter

This has been one of my dreams for quite a while. I mentioned it a few times on the SpaCy forums too. Very glad to see it make it's way into the library now.

senter = model_eff.get_pipe("senter")
optimizer = model_eff.initialize()

examples = [example]
losses = senter.update(examples, sgd=optimizer)
losses
{'senter': 1.9999818801879883}