Flarenet SILT 2009 - Notes on Papers

Tags:
Link:	webpage

Flarenet/SILT 2009 meeting at Brandeis

Flarenet SILT 2009

Meatadata

Language resources are expensive and rare
Need to able to discover existing resources

Current problems:

no union catalog
Inconsistent metadata (both within and between catalogs)
Metadata formats may be unfamiliar/unintuitive
To what extent can we use modern search/data discovery techniques to help?

Data categories & their semantics

Problems:

semantics of categories
structured categories?
mapping categories
collections of categories (profiles?)

Data categories are defined in a registry. Some subset of the categories are standardized. Other data categories can be mapped to the standard ones. Standard set of data categories ("ISOCAT") is used as a pivot (interlingua) to convert between resources that use different data categories.

Data specifies:

form of data
interpretation of elements.

Tools specify:

form of input/output
interpretation of data

Interoperability: between tools/data, data/data, and tool/tool.

http://www.isocat.org

Each data category gets a persistent identifier. You can supply definitions & examples, in multiple languages. You can restrict the definition of some categories to specific languages. Data categories can have associated value domains (enumerated lists of values).

Goal of isocat:

define widely accepted linguistic concepts
build standard categories
user-driven (wiki-ish)

Thematic domain groups:

group of experts that select and maintain data categoreis that are relevant to a specific thematic domain.

Specific issues

What is the interpretatin of an ISOCAT data category?
How do we map between them?

Data Publication

What requirements on the publication of a data resource maximize the potential usefulness of that resource?

What makes a resource interoperable?

Representation format (syntax)
Semantics
Pragmatic interoperability?

Software

interoperability:

software/data
software/software
data/data

Simple world: harmony the whole world round

universal data formats
- stand-off, plentiful metdata, etc.
universal data semantics
- type system, with links to ontology/external docs (isocat, etc)
  - divide the notion of what types of annotation there are from the notion of how they're related. leave open the possibility of having multiple versions of the same data, etc.

Making tools work together: semantic differences

tools should specify what info they need.
specify a "view" of the data
- simplest implementation: attribute mapping
preserve data: pass through
- don't touch things you don't care about
- persistent identifiers on all annotation bits?
merging data
- normalize out redundant data
- what to do with conflicting data?

Making tools work together: syntactic differences

annotation mapping tool: bidirectional mappings to many formats
- use a small number (1?) of pivot languages
- we need something like a "view" here too -- what maps to what?
automated annotation alignment
- how does the output of a system relate to its input?
annotation merging
- much easier if we can align annotations
- what's redundant?

Linking different annotation types

Implicit links between annotations (e.g., two annotations spanning the same bit of text)
Programs whose job is to link up different annotation types
- Resolving conflicts

Use case

record data with tool T1; transcribe it with tool T2; tokenize transcriptions with tool T3; annotate temporal expressions with tool T4; parse the tokenized text with tool T5 (first doing pos tagging with tool T6); build a supervised model to predict temporal expressions with tool T7; and then use it to annotate new texts.

short term:

how can we make tools more discoverable?
- finding what tools exist
- finding information about the usability of those tools
  - licensing
  - user reviews
  - input and output interfaces
  - etc
how can we make existing tools work together, and that can process a wide variety of data sources?
create an ontology or specification language for the input/output formats used by tools

long term:

what is the best methodology for creating tools that work together?

Stakeholders:

tool providers
interchange format providers [integration platform providers]
users
flarenet/silt

What we need:

Specification language for input/output formats [silt]
- formal language for specifying what the input & output of a tool is.
- or registry of formats?
- how does this relate to standard data categories?
Specify API and/or the input and output of each tool [tool provider, or someone else]
- syntax: how are things encoded
- semantics: what does each piece mean? Be as specific as possible: not just pos tag/msd, but specify which tag set, etc. But no more specific than the tool is -- e.g., some tools can work with different tag sets..
- how does the output relate to the input? (how) can we align them?
Adapters [Interchange format providers]
- Adapters that convert the interchange language to various tool formats
- Adapters that convert tool formats to the interchange language
  - This is non-trivial. E.g., alignment, merging, eliminating duplication.
- Conversion to/from other interchange formats
Interchange formats
- Need to be flexible and extensible
- Need to be have something like DTDs
  - There should be recommended/standard types. Don't just leave it entirely open.
  - Ideally, it should be possible to map between elements and some ontology like isocat or gold
Guidelines for new tools (for interoperability)
- alignment
  - basic level: it is possible to align the output with the input
  - better: it is trivial to align. e.g., preserve identifiers for pieces
  - best: pass through information.
- monotonic increasing: typically tools should add, not modify
- should specify the input/output format
Support for "views" of data
- Could be done within tools, or using adapters
  - If using adapters, you may need to 'move some attributes out of the way'
Wrappers
- binary wrappers (run existing tools as subprocesses)
- web services wrappers