FLaReNet 2009 - Notes on Papers

Location:	Vienna, Austria
Date:	March 12-13, 2009
Webpage:	http://www.flarenet.eu/?q=Vienna09_Description
Tags:	[flarenet]

Fostering Language Resources Network

FLaReNet 2009

Opening Session

Roberto Cencioni, Walther Lichem, Nicoletta Calzolari, Gerhard Budin

EC - DG Information Society & Media - Unit INFSO.E1 - LTs & MT, LUX / Head of Unit

Roberto Cencioni

Language Technology & MT unit: INFSO.E1

new unit established in July 2008
focus on multilingual technologies, services, apps
two instruments in 2009:
- Research: FP7 ICT, call 4 (objective 2.2: language based interaction)
- Innovation: CIP ICT-PSP, call 3 (theme 5: multilingual web)

Web is a collaborative framework. But there are significant language barriers. Europ has 23+ widely-used languages -- not enough translators. Want to have a single European information space: bridge the language barriers in the information society.

Goals: support & enhance interpersonal & business communication, information access, and publishing..

for everyone
across langauges
emphasis on online environments

Flarenet

Nicoletta Calzolari

Flarenet: an international forum to facilitate interaaction, and...

overcome fragmentation
create a shared policy for the field of langauge resources & technologies
foster a European strategy for consolidating the sector and enhancing competitiveness

Language resources & technologies:

used by many different communities
various dimensions: technical, organizational, economic, legal, political

Promote international cooperation

E.g., w/ US SILT initiative

Session 1: Broadening the Coverage, Addressing the Gaps

Introduction

Joseph Mariani (LIMSI/IMMI-CNRS, FR)

2 Issues:

Langauge coverage
Topic coverage

Availability of langauge resources of sufficient quality & quantity is needed to develop language technologies

LRs and LTs in one language:

define the task
Determine the needed LTs
Determine the annotations of LRs and metrics & protocols for evaluation
Find a way to support production of LRs and evaluation (incl financial, organizational)
Produce LRs
Develop and evaluate LTs
Cycle back to improve

Try to fill in language resource matrices: one axis is list of languages (incl dialects); the other lists resources (lexicon, pronunciationl lexicon, dictionary, treebank, etc). Fill in matrix squares with information about how much data is available.

Similarly, fill in a langauge technology matrix: tokenizer, named entities, chunker, pos tagger, parser, spell checker, summarizer, search engine, text to speech, ASR, etc. Fill in matrix with best available performance, etc.

Multi-lingual matrix showing how many parallel systems, or machine translation systems, exist for various pairs of languages. Matrix could be amount of data, system performance, etc.

Coverage & BLARKS

Steven Krauwer (Universiteit Utrecht, NL) & Khalid Choukri (ELDA, FR)

BLARK = Basic Langauge resources Kit

Defines the minimum set of LRs necessary to do education or pre-competitive langauge and speech technology research for a language at all
LR includes written, spoken, & multimodal materials, and modules and tools
Can be used to measure the technological coverage of a language
Can be used as an agenda for creating new resources

It's dynamic, because technology, requirements, and expectations evolve over time.

So the notion of "minimum set of resources" needs to be periodically re-evaluated

Used with a view to developing language and speech technology applications for a language.

Entry level BLARKettes for languages with virtually no tech support, mainly aimed at training and education.
Standard BLARK, serving education and pre-competitive research
Extended BLARK, serving advanced research & commercial development

It's important that these collected tools be coherent, and interoperable.

Build on existing initiatives:

Try to give an authoritative definition of what BLARK (currently) contains
Analysis per language of where we stand
Mechanisms for maintenance

BLARK as tool for langauge resource coverage assesment, road mapping, and language policy planning:

Many LR pieces are missing.
- Some are available, but not at "fair conditions"
- What exists but is not available (traded vs non-traded)
- What exists but is not identified
- What is lost -- not archived
- What does not exist at all

What can we (langauge experts) do for BLARK?

Define, specify, improve: See www.blark.org
Help enhance the content of the universal catalogue

What can BLARK do for you?

Help you w/ input for R&D

What can funding agencies do for BLARK?

Make sure the info is accurate and reflects community plans/trends
Used as a consistent planning/roadmapping inventory
Used as an inventory of the state of the art

Universal catalogue of common metadata.

Interface w/ other communities: check who is producing, as a "side-product," essential LRs for HLT: publishers, broadcast companies, etc. Public data.

"Research Fair Act" a la European Union -- look this up.

Practical Considerations in Resource Creation Tied to Human Language Technology Development

Christopher Cieri (University of Pennsylvania - LDC, USA)

GALE: transcription, translation, and distillation

LCTL: langauge packs for IE and translation in 1-2 dozen langauges (LCTL = less commonly taught languages)

LVDID: train & test material for SRE

(this was an interesting talk -- see slides, because I didn't take many notes.)

Tradeoffs, such as quality vs quantity, may change as the amount of data changes, as the qualities of the tools changes, etc.

It's important to give away tools and specs -- allows good outsourcing

When a corpus is donated to LDC, distribution is never exclusively via LDC. Distribution via multiple sites is a good thing.

An African Perspective on Language Resources and Technologies

Justus Roux (University of Stellenbosch, S. Africa)

Interest in HLT development in Africa

LR & LT
Renewed linguistic & cultural awarenes of indigenous langauges

Which languages should resources be developed for?

Colonial langauges
Indigenous langauges

Role of national governments with respect to local languages

Role of the private sector?

Companies appreciate the utility of supportning local langauges -- competitive edge

Awareness campaigns.

Support development LTs/LRs of "African" varieties of European languages

In africa, speech based systems are a priority. High illiterarcy rate levels. Limited internet penetration. High penetration (~50%) of cell phone services -- as high as 95% penetration in some countries.

Coverage of What? – Gaps in What? On De-globalising Human Language Resources

Dafydd Gibbon (Universität Bielefeld, DE)

Scientific responsibility

Primarily we generally represent..

our own interests, and the interests of our groups
but also the interests of larger political entities in which we live

But our competitive funding systems are:

exclusive in that they create temporary research islands
inclusive in creating sustainable research infrastuctures

Cooperative instruments are also needed.

Coverage:

Implies advancement in the future of domains, methods, applicaitons

Gap:

Implies a failure in the past: small repairable omission; reference to some platonic ideal of completeness.

But our coverage is a drop in the ocean, so "gap" isn't really an appropriate term -- the gap is huge.

De-globalization: concentration on & respect for less mainstream languages.

Why de-globalization?

Knowledge of complex forms & functions of various languages
Digital divide: technology gap, and its special case the digital divide, is taken to be asymmetrical

Sponsorship of underprivileged colleagues/groups. Vicious circle: colleagues can't afford contact with competitive finishing, and publishing conventions; and therefore can't meet the standards.

What does a BLARKette cost? Meetings are the engine, but funding is necessary.

Shared Language Resources Production

Asunción Moreno (Universitat Politècnica de Catalunya, SP)

Producing LRs is expensive and labor intensive.

Solution: shared produciton. Produce and exchange large databases of LR, with multiple people adding to them

Past example: SpeechDat (SPD) = large databases for ASR.

Model:

Specificaiton: exactly defines the linguistic contents, number of speakers, ages, etc
Production: each partner produces a database in the given langauge that fits the specs
Validation: an external organization that ensures that databases meet specs, checks transcripts and lexica, measures audio signal quality
Exchange: positively validated databases are exchanged between partners.

Advantages:

quality of databases is comparable
Databases are produced in a specified period of time
Total cost is affordable, assuming the number of partners is big enough.

Disadvantages:

sometimes not all partners are interested in all the languages
delays and withdrawals can impact internal companies plannings
specs don't fit the needs of companies exactly

Shared LR production has become a common practice between companies. But this model may not be attractive enough for research centers and universities.

A Dynamic View of Comparable and Specialized Corpora

Pierre Zweigenbaum (LIMSI-CNRS, FR)

A corpus is not a random heap of texts.
A corpus is organized

Comparable corpora

bitexts and parallel corpora. Useful for MT. Limited in number and variety.
c.f. "comparable corpora," which are similar according to a set of dimensions (topics, genres, etc). Much easier to get, but they're not direct translations of one another. They don't pre-exist: they need to be selected and paired.
Range: parallel corpora, noisy parallel corpora, comparable corpora.
Need a way to measure "comparability". Some measures have been proposed. Should be useful for evaluating, but also for generating comparable corpora.
There's also comparable corpora within a single language

Specialized corpora

Activities in specialized domains: sublanguage

terms for terminology
relations -> structure terminologies and ontologoies
specialized multilingual corpora for translation

Examples: MEDLINE, JRC Acquis (EU law)

Specifying a specialized corpus:

topic
task
intended audience
etc

Multiple dimensions. Hard to forsee all needs. Can't design and build all specialized corpora that we want in advance.

Evolution of knowledge: terminology evolves with time. Value of knowledge depends on currency -- specialized corpora generally need to evolve with time: need to be updated. Also need on-demand selection of subsets of the text, carefully selected according to needs.

Methods are needed to measure and characterize dimensions of specialization

Are any corpora not specialized?

Access issues

legal
privacy

Technology for Processing Non-verbal Information in Speech

Nick Campbell (Trinity College Dublin, IRL & NIST, JP)

Speech is action & interaction.

Current speech technology is founded on text.

There's a mismatch between the expectation of the systems and the performance of its users.

Talk in social interaction involves propositional context, but also other information channels

Systems that process human speech need to be able to interpret the underlying speech acts. Not enough to say what the person is saying, but need to know what they're doing (in the context of the conversation). A lot of communication comes from nonverbal signals, incl affective speech sounds such as laughs, feedback noises, grunts, etc. Constitute a small finite set of highly variable sounds in which most of the information is carried by prosody and tone-of-voice.

Discussion

Discussants:

Adam Przepiórkowski (Polish Academy of Sciences - ICS, PL)
Marko Tadić (University of Zagreb - FHSS - DL, HR)
Kepa Sarasola Gabiola (University of the Basque Country - IXA Group, SP)
Folkert de Vriend (Nederlandse Taalunie, NL-BE)

Session 2: Automatic and Innovative Means of Acquisition, Annotation, Indexing

Methods & models for building, validating, and maintaining raw & annotated LRs

Questions/parameters: required volume, coverage, manual vs automatic vs semiautomatic, standards/formats, language (in)dependence, performance, cost, time

Primary language data are abundant on the web -- for many languages, and for an increasing number of languages.

Web data contains "ill-formed" text communication. Also, images, videos, etc

Some of the data that are on the web are basically annotations already -- e.g., summaries, transcipts/subtitles, image captions, opinions, etc.

For this session:

current methods & scale of use
suitability, success stories, best practices
missing, new resources; future targets; priorities
manual vs automatic
data volume vs quality tradeoff
well formed vs "ill-formed" language
LR needs of areas related to langauge (multimedia, cognitive resources, etc)
social networking

Rich Annotations of Text and Community Annotation

Jun'ichi Tsujii (University of Manchester - NacTeM, UK)

MEDLINE: 2000 abstracts, ~500k words

Linguistic annotation: tokenization, pos tagging, parse tree, dependency, deep syntax, coref

Ontology-based annotation: NER, RR, Event Recognizer, Pathway Constructer

Community annotation: have biologists annotate pathways.

LT applications vs LRs

Dialogue corpora remain a problem

Yorick Wilks (University of Sheffield, UK)

Trends in Language Resources and New Work in ASR Data Labeling

Gary Strong (Johns Hopkins University - HLT Center of Excellence, USA)

Trends:

Hand annotation to Semi-supervised
Corpus-based annotation to non-stationary steam processing, and adaptation to domain/genre changes.
Moving from small datasets to effectively infinite streams

Bootstrapping

Going for a Hunt? Don’t Forget the Bullets!

Dan Ioan Tufis (RACAI, RO)

Misconceptions about language resources: ML can do everything with lots of raw data; human expertise is less and less needed.

partially true, but with accurate annotations the data hunger is much lower and the quality of services is increased
minimal set of pre-processing steps (BLARK-like resources)

The quality of LRs come from the accuracy of linguistic annotations

BLARK-like resources & tools
Better scenario: start with very clean annotated data, and bootstrap. But make sure the expert is in the development chain, to validate and correct annotations.

Automatic Lexical Acquisition - Bridging Research and Practice

Anna Korhonen (University of Cambridge, UK)

The Democratisation of Language Resources

Gregory Grefenstette (Exalead, FR)

Advocates building a simple tabular lexicon for each language. Incl. word form, root, pos tag, freq, simple translation.

Web3.0 and Language Resources

Marta Sabou (Open University, UK)

Exploiting Croudsourced Language Resources for Natural Language Processing: 'Wikabularies' and the Like

Iryna Gurevych (Technische Universität Darmstadt - UKP Lab, DE)

Discussion

Discussants:

Kiril Simov (Bulgarian Academy of Sciences - IPP - LML, BG)
Sophia Ananiadou (University of Manchester - NacTeM, UK)
Guy De Pauw (Universiteit Antwerpen, BE)

(at this point, I was fairly tired from jet lag, and stopped taking notes -- see the webpage for the position papers & slides for session 3)

S4 - Interoperability and Standards

Introduction

"SILT: Towards Sustainable Interoperability for Language Technology"

James Pustejovsky (Brandeis University - DCS, USA) & Nancy Ide (Vassar College - DCS, USA

"Interoperability, Standards and Open Advancement"

Eric Nyberg (Carnegie Mellon University, USA)

"Is the LRT Field Mature Enough for Standards?"

Peter Wittenburg (MPG, NL)

"Interoperability via Transforms"

Edward Loper (Brandeis University, USA)

"Ontology of Language Resource and Tools for Goal-oriented Functional Interoperability"

Key-Sun Choi (KAIST, KR)

"Towards Interoperability of Language Resources and Technologies (LRT) with Other Resources and Technologies"

Thierry Declerck (DFKI, DE)

Discussion

Discussants: Tomaž Erjavec (Jožef Stefan Institute, SI) Chu-Ren Huang (Hong Kong Polytechnic University, HK) Timo Honkela (Helsinki University of Technology - CIS, FI) Yohei Murakami (NICT, JP)

Session 5: Translation, Localisation, Multilingualism

Language Resources and Tools for Machine Translation

Hans Uszkoreit (DFKI, DE)

There's been progress in statistical MT, and in linguistic processing (parsing, morphology, generation, etc)

Less progess, but increased use, in rule-based MT.

Increasing use of hybrid systems, and system combination.

For SMT: no good solutions for non-local grammatical phenomena; and no good solutions for (lexical and syntactic) gaps in training data

For hybrid MT: lack of confidence estimation; lack of good solution for gaps in rules

Insufficient classification of data with respect to its domain, etc
Insufficient parallel texts, and sufficient coverage of different domains, etc

Outlook for Spoken Language Translation

Marcello Federico (FBK, IT)

Three Challenges for Localisation

Josef Van Genabith (Dublin City University - NCLT, IRL)

Assessing User Satisfaction with Embedded MT

Tony Hartley (University of Leeds, UK)

Institutional Translators and LRT

Josep Bonet-Heras (EC - DG Translation, LUX)

Language Technology in the European Parliament's Directorate General for Translation: Facts, Problems and Visions

Alexandros Poulis (EP - DG Translation - IT Support Unit, LUX)

'Cloud Sourcing' for the Translation Industry

Andrew Joscelyne (TAUS, FR)

Discussion

Discussants Frank Van Eynde (Katholieke Universiteit Leuven - CCL, NL) Harold Somers (Dublin City University - SC, IRL)