Tags: | |
---|---|
Link: | webpage |
Flarenet/SILT 2009 meeting at Brandeis
Meatadata
- Language resources are expensive and rare
- Need to able to discover existing resources
Current problems:
- no union catalog
- Inconsistent metadata (both within and between catalogs)
- Metadata formats may be unfamiliar/unintuitive
- To what extent can we use modern search/data discovery techniques to help?
Data categories & their semantics
Problems:
- semantics of categories
- structured categories?
- mapping categories
- collections of categories (profiles?)
Data categories are defined in a registry. Some subset of the categories are standardized. Other data categories can be mapped to the standard ones. Standard set of data categories ("ISOCAT") is used as a pivot (interlingua) to convert between resources that use different data categories.
Data specifies:
- form of data
- interpretation of elements.
Tools specify:
- form of input/output
- interpretation of data
Interoperability: between tools/data, data/data, and tool/tool.
Each data category gets a persistent identifier. You can supply definitions & examples, in multiple languages. You can restrict the definition of some categories to specific languages. Data categories can have associated value domains (enumerated lists of values).
Goal of isocat:
- define widely accepted linguistic concepts
- build standard categories
- user-driven (wiki-ish)
Thematic domain groups:
- group of experts that select and maintain data categoreis that are relevant to a specific thematic domain.
Specific issues
- What is the interpretatin of an ISOCAT data category?
- How do we map between them?
Data Publication
What requirements on the publication of a data resource maximize the potential usefulness of that resource?
What makes a resource interoperable?
- Representation format (syntax)
- Semantics
- Pragmatic interoperability?
Software
interoperability:
- software/data
- software/software
- data/data
Simple world: harmony the whole world round
- universal data formats
- stand-off, plentiful metdata, etc.
- universal data semantics
- type system, with links to ontology/external docs (isocat, etc)
- divide the notion of what types of annotation there are from the notion of how they're related. leave open the possibility of having multiple versions of the same data, etc.
- type system, with links to ontology/external docs (isocat, etc)
Making tools work together: semantic differences
- tools should specify what info they need.
- specify a "view" of the data
- simplest implementation: attribute mapping
- preserve data: pass through
- don't touch things you don't care about
- persistent identifiers on all annotation bits?
- merging data
- normalize out redundant data
- what to do with conflicting data?
Making tools work together: syntactic differences
- annotation mapping tool: bidirectional mappings to many formats
- use a small number (1?) of pivot languages
- we need something like a "view" here too -- what maps to what?
- automated annotation alignment
- how does the output of a system relate to its input?
- annotation merging
- much easier if we can align annotations
- what's redundant?
Linking different annotation types
- Implicit links between annotations (e.g., two annotations spanning the same bit of text)
- Programs whose job is to link up different annotation types
- Resolving conflicts
Use case
record data with tool T1; transcribe it with tool T2; tokenize transcriptions with tool T3; annotate temporal expressions with tool T4; parse the tokenized text with tool T5 (first doing pos tagging with tool T6); build a supervised model to predict temporal expressions with tool T7; and then use it to annotate new texts.
short term:
- how can we make tools more discoverable?
- finding what tools exist
- finding information about the usability of those tools
- licensing
- user reviews
- input and output interfaces
- etc
- how can we make existing tools work together, and that can process a wide variety of data sources?
- create an ontology or specification language for the input/output formats used by tools
long term:
- what is the best methodology for creating tools that work together?
Stakeholders:
- tool providers
- interchange format providers [integration platform providers]
- users
- flarenet/silt
What we need:
- Specification language for input/output formats [silt]
- formal language for specifying what the input & output of a tool is.
- or registry of formats?
- how does this relate to standard data categories?
- Specify API and/or the input and output of each tool [tool provider, or someone else]
- syntax: how are things encoded
- semantics: what does each piece mean? Be as specific as possible: not just pos tag/msd, but specify which tag set, etc. But no more specific than the tool is -- e.g., some tools can work with different tag sets..
- how does the output relate to the input? (how) can we align them?
- Adapters [Interchange format providers]
- Adapters that convert the interchange language to various tool formats
- Adapters that convert tool formats to the interchange language
- This is non-trivial. E.g., alignment, merging, eliminating duplication.
- Conversion to/from other interchange formats
- Interchange formats
- Need to be flexible and extensible
- Need to be have something like DTDs
- There should be recommended/standard types. Don't just leave it entirely open.
- Ideally, it should be possible to map between elements and some ontology like isocat or gold
- Guidelines for new tools (for interoperability)
- alignment
- basic level: it is possible to align the output with the input
- better: it is trivial to align. e.g., preserve identifiers for pieces
- best: pass through information.
- monotonic increasing: typically tools should add, not modify
- should specify the input/output format
- alignment
- Support for "views" of data
- Could be done within tools, or using adapters
- If using adapters, you may need to 'move some attributes out of the way'
- Could be done within tools, or using adapters
- Wrappers
- binary wrappers (run existing tools as subprocesses)
- web services wrappers