Flarenet/SILT 2009 meeting at Brandeis

Flarenet SILT 2009


  • Language resources are expensive and rare
  • Need to able to discover existing resources

Current problems:

  • no union catalog
  • Inconsistent metadata (both within and between catalogs)
  • Metadata formats may be unfamiliar/unintuitive
  • To what extent can we use modern search/data discovery techniques to help?

Data categories & their semantics


  • semantics of categories
  • structured categories?
  • mapping categories
  • collections of categories (profiles?)

Data categories are defined in a registry. Some subset of the categories are standardized. Other data categories can be mapped to the standard ones. Standard set of data categories ("ISOCAT") is used as a pivot (interlingua) to convert between resources that use different data categories.

Data specifies:

  • form of data
  • interpretation of elements.

Tools specify:

  • form of input/output
  • interpretation of data

Interoperability: between tools/data, data/data, and tool/tool.


Each data category gets a persistent identifier. You can supply definitions & examples, in multiple languages. You can restrict the definition of some categories to specific languages. Data categories can have associated value domains (enumerated lists of values).

Goal of isocat:

  • define widely accepted linguistic concepts
  • build standard categories
  • user-driven (wiki-ish)

Thematic domain groups:

  • group of experts that select and maintain data categoreis that are relevant to a specific thematic domain.

Specific issues

  • What is the interpretatin of an ISOCAT data category?
  • How do we map between them?

Data Publication

What requirements on the publication of a data resource maximize the potential usefulness of that resource?

What makes a resource interoperable?

  • Representation format (syntax)
  • Semantics
  • Pragmatic interoperability?



  • software/data
  • software/software
  • data/data

Simple world: harmony the whole world round

  • universal data formats
    • stand-off, plentiful metdata, etc.
  • universal data semantics
    • type system, with links to ontology/external docs (isocat, etc)
      • divide the notion of what types of annotation there are from the notion of how they're related. leave open the possibility of having multiple versions of the same data, etc.

Making tools work together: semantic differences

  • tools should specify what info they need.
  • specify a "view" of the data
    • simplest implementation: attribute mapping
  • preserve data: pass through
    • don't touch things you don't care about
    • persistent identifiers on all annotation bits?
  • merging data
    • normalize out redundant data
    • what to do with conflicting data?

Making tools work together: syntactic differences

  • annotation mapping tool: bidirectional mappings to many formats
    • use a small number (1?) of pivot languages
    • we need something like a "view" here too -- what maps to what?
  • automated annotation alignment
    • how does the output of a system relate to its input?
  • annotation merging
    • much easier if we can align annotations
    • what's redundant?

Linking different annotation types

  • Implicit links between annotations (e.g., two annotations spanning the same bit of text)
  • Programs whose job is to link up different annotation types
    • Resolving conflicts

Use case

record data with tool T1; transcribe it with tool T2; tokenize transcriptions with tool T3; annotate temporal expressions with tool T4; parse the tokenized text with tool T5 (first doing pos tagging with tool T6); build a supervised model to predict temporal expressions with tool T7; and then use it to annotate new texts.

short term:

  • how can we make tools more discoverable?
    • finding what tools exist
    • finding information about the usability of those tools
      • licensing
      • user reviews
      • input and output interfaces
      • etc
  • how can we make existing tools work together, and that can process a wide variety of data sources?
  • create an ontology or specification language for the input/output formats used by tools

long term:

  • what is the best methodology for creating tools that work together?


  • tool providers
  • interchange format providers [integration platform providers]
  • users
  • flarenet/silt

What we need:

  • Specification language for input/output formats [silt]
    • formal language for specifying what the input & output of a tool is.
    • or registry of formats?
    • how does this relate to standard data categories?
  • Specify API and/or the input and output of each tool [tool provider, or someone else]
    • syntax: how are things encoded
    • semantics: what does each piece mean? Be as specific as possible: not just pos tag/msd, but specify which tag set, etc. But no more specific than the tool is -- e.g., some tools can work with different tag sets..
    • how does the output relate to the input? (how) can we align them?
  • Adapters [Interchange format providers]
    • Adapters that convert the interchange language to various tool formats
    • Adapters that convert tool formats to the interchange language
      • This is non-trivial. E.g., alignment, merging, eliminating duplication.
    • Conversion to/from other interchange formats
  • Interchange formats
    • Need to be flexible and extensible
    • Need to be have something like DTDs
      • There should be recommended/standard types. Don't just leave it entirely open.
      • Ideally, it should be possible to map between elements and some ontology like isocat or gold
  • Guidelines for new tools (for interoperability)
    • alignment
      • basic level: it is possible to align the output with the input
      • better: it is trivial to align. e.g., preserve identifiers for pieces
      • best: pass through information.
    • monotonic increasing: typically tools should add, not modify
    • should specify the input/output format
  • Support for "views" of data
    • Could be done within tools, or using adapters
      • If using adapters, you may need to 'move some attributes out of the way'
  • Wrappers
    • binary wrappers (run existing tools as subprocesses)
    • web services wrappers