Contributions from:

BelgiUM _ Czech Republic _ France _ Italy _ Poland _ Portugal _ Slovakia _ Spain _ UK _ USA _

WELCOME to the
Language Resources for Public Security Workshop (LRPS 2012)
at LREC 2012

May 27, 2012, Istanbul, Turkey
 

   
XXXXXXXX
LREC 2012 - Official banner

INTRODUCTION

Dear Colleagues,

Public security in Europe and in the World is facing several threats. These include threats connected with intended human activities such as terrorism, spontaneous risks related to uncontrolled behavior of individuals involved in mass events, natural disasters, etc. Combating these dangers generates challenges for information and communication technologies which in many cases directly involve various forms of natural language processing. Gathering, maintaining and processing language resources specific for security applications is of primary importance for the language technologies concerned. In some cases it appears useful to investigate and use sensitive linguistic data which generates technological and legal problems connected with privacy, ownership civic rights protection etc.

The workshop is intended to serve as a thematic discussion forum open to:
- language resources suppliers,
- researchers and language engineers interested in the development of systems for security applications involving language technologies,
- potential/actual users of such systems,
- people concerned with legal aspects of gathering, maintenance and applications of language resources for public security purposes.

Generation of long term cooperation projects involving the workshop participants would be a desired side effect of the workshop.

Zygmunt Vetulani and Edouard Geoffrois
LRPS 2012 Co-Chairs
contact: vetulani@amu.edu.pl


AREAS OF INTEREST

The workshop focusses on the knowledge processing applications serving public security. Particular emphasis is given to the crucial role of language resources and related technologies. The discussion is open - but not limited to - the issues: :

  • security specific corpora
  • security specific terminology
  • language models for specific sub-languages and language registers important for security research
  • language technology based tools to enhance public security
  • linguistic tools for risk assessment
  • controlled languages for public security applications
  • AI and NLP decision supporting systems
  • sharing and processing sensitive linguistic data
  • legal aspects of security-oriented natural language processing and engineering
  • access to sensitive data
  • IPR issues
  • protection and use of sensitive source data
  • international collaboration issues
  • issues related with national and international funding

This list is by no means closed and we remain open to other fascinating issues. Suggestions, ideas and observations which may be useful to prepare the discussion may be addressed directly to the LRPS co-chairs by email (vetulani@amu.edu.pl).


PROGRAM COMMITTEE

Zygmunt Vetulani (Adam Mickiewicz University, Poznań, Poland) - co-chair
Edouard Geoffrois (Direction Générale de l'Armement, Paris, France) - co-chair
 
Laura Chaubard (Direction Générale de l'Armement, Paris, France)
Jakub Gorczyński (Polish National Police, Poznań, Poland)
Fryni Kakoyianni-Doa (University of Cyprus, Nicosia, Cyprus)
Nasrullah Memon (University of Southern Denmark, Odense, Denmark)
Mario Montoleone (Salerno University, Italy)
Karel Pala (Masaryk University, Czech Republic)
Frederique Segond (Object Direct, Grenoble, France)
Tadeusz Tomaszewski (University of Warsaw, Poland)

ORGANIZING COMMITTEE

Zygmunt Vetulani - (Adam Mickiewicz University, Poznań, Poland) - co-chair
Edouard Geoffrois (Direction Générale de l'Armement, Paris, France) - co-chair
Wojciech Czarnecki - secretary

LANGUAGE

The LRPS workshop language is English.



PRESENTATIONS

According to the LREC tradition there is no relationship between the paper quality and the presentation method (oral or poster).

Three papers (focusing on resources) will be presented during the oral session and the remaining eight (more application-oriented) at the poster session. The poster session will be chaired.

The allowed poster size is A0, vertical.


DATE AND PLACE

  • Workshop date: May 27, 2012 at 14:00
  • Workshop place: The Lütfi Kirdar Istanbul Exhibition and Congress Centre
  • More information at the LREC 2012 home page


FEES, PAYMENT, REGISTRATION

Registration and payement: as described at the LREC 2012 home page


WORKSHOP PROGRAM

14:00 – 14:20 – Opening and introductory presentation by Zygmunt Vetulani and Edouard Geoffrois
14:20 – 15:00 – Invited Keynote Talk by Chris Cieri
15:00 – 16:00 – Resources (oral presentation)
16:00 – 16:30 - Coffee break
16:30 – 17:15 – Resources in Public Security Applications
17:15 – 18:00 – General discussion moderated by Frederique Segond

20:00 - 23:00 - Diner (informal) /not includeed in workshop fees/


Sunday, May 27, 2012
The Lütfi Kirdar Istanbul Exhibition and Congress Centre

14:00 - 14:20

Opening and introductory presentation by Zygmunt Vetulani and Edouard Geoffrois

Zygmunt Vetulani, Edouard Geoffrois, Wojciech Czarnecki and Bartłomiej Kochanowski, Language Resources for Public Security Applications: Needs and Specificities [Abstract]

14.20 - 15:00

Invited Keynote Talk by Chris Cieri (University of Pennsylvania, USA, Linguistic Data Consortium, Executive Director)

Chris Cieri, Language Resources for Public Security Applications: a Data Center Perspective [Abstract]

15:00 - 16:00

Resources (oral presentations)

Adam Dąbrowski, Szymon Drgas, Paweł Pawłowski and Julian Balcerek, Development of PUEPS - corpus of emergency telephone conversations [Abstract]

Irina Temnikova and K. Bretonnel Cohen, The Crisis Management Corpus and its Application to the Study of the Crisis Management Sub-language [Abstract]

Christian Fluhr, Aurélie Rossi, Louise Boucheseche and Fadhela Kerdjoudj, Extraction of information on activities of persons suspected of illegal activities from web open sources [Abstract]

16:00 - 16:30

Coffee Break

16:30 - 17:15

Resources in Public Security Applications (poster presentations)

Carlo Aliprandi, Tomas By and Sérgio Paulo, Language Processing and Linguistic Data in the CAPER Project [Abstract]
Richard Beaufort, Alexander Panchenko and Cédrick Fairon, Detection of Child Sexual Abuse Media on P2P Networks: Normalization and Classification of Associated Filenames [Abstract]
Simona Cantarella, Carlo Ferigato and Evans Boateng Owusu, Design of a Controlled Language for Critical Infrastructures Protection [Abstract]
Ales Horak, Karel Pala and Jan Rygl, Authorship Identification to Improve Public Security [Abstract]
Wiesław Lubaszewski and Michał Korzycki, Unexpected Factual Associations Mining [Abstract]
Miriam R L Petruck and Gerard de Melo, Precedes: A Semantic Relation in FrameNet [Abstract]
Milan Rusko, Sakhia Darjaa, Marian Trnka, Miloš Cerňak, Expressive speech synthesis database for emergent messages and warnings generation in critical situations [Abstract]
Zygmunt Vetulani, Language Resources in a Public Security Application with Text Understanding Competence. A Case Study: POLINT-112-SMS [Abstract]

17:15 - 18:00

General discussion (animated by Frédérique Segond)

20:00 - ...

Dinner (informal) /not includeed in workshop fees/




ABSTRACTS

Language Resources for Public Security Applications: Needs and Specificities
Zygmunt Vetulani, Edouard Geoffrois, Wojciech Czarnecki and Bartłomiej Kochanowski

Abstract:
Language technologies and the associated language resources necessary to develop them are needed in a number of applications in the public security sector, and there is a growing demand for such applications. The paper illustrates the scope and importance of the needs by presenting various examples of applications along with the corresponding language technologies and language resources. However, collecting and sharing these resources can be especially difficult in that sector due to its specificities. The paper proposes to better identify and acknowledge these specificities in order to better address them and suggests that sharing experience across the various applications within the sector might help to overcome the difficulties.



Language Resources for Public Security Applications: a Data Center Perspective
Chris Cieri

Abstract:
Among the many corpora that LDC is producing or distributing, several, for example some of the Mixer corpora, are related to public security variously defined. In this talk we present some of these corpora and how they were created. We also describe some of the issues encountered in their creation which are related to the public security domain, how we overcame them and the lessons learned. Some specific issues we will discuss include matching data specifications to rapidly evolving requirements, managing intellectual property, protecting the privacy of human subjects and distributing resulting data.



Development of PUEPS - corpus of emergency telephone conversations
Adam Dąbrowski, Szymon Drgas, Paweł Pawłowski and Julian Balcerek

Abstract:
In this article development of a PUEPS corpus is described. This dataset contains recordings of the acted emergency telephone conversations. Speakers that participated in the experiments reported crime scenes that were presented to them in a form the earlier prepared movies. Recording sessions were performed in the laboratory conditions. To each conversation metadata that summarize information about the speaker, conversation, and the reported event were added. Moreover, manually prepared transcriptions enriched with tags describing paralinguistic phenomena are also a part of the described corpus. These transcriptions were made using tools prepared by the authors for fast and convenient work due to: prompting, annotation, and data management mechanisms. The transcription experiments showed substantial improvement of the work efficiency and speed. Final multilevel speaker recognition experiments proved that the accuracy of the speaker recognition is noticeably improved due to the use of transcriptions and the linguistic level analysis.



The Crisis Management Corpus and its Application to the Study of the Crisis Management Sub-language
Irina Temnikova and K. Bretonnel Cohen

Abstract:
This article presents a novel language resource, the Crisis Management Corpus (CMC). The corpus is the first in its domain and is expected to be of utility for linguistic studies and for natural language processing applications in the crisis management and the public security domains. The article describes the collection, pre-processing and composition of this resource, along with its possible applications. Two example applications of the resource are described in detail. The first application is the study of the text complexity levels characterizing the CMC, with the aim of evaluating the communicative efficiency of written documents in the domain. The second application is a preliminary investigation of the linguistic characteristics of the crisis management sub-language.



Extraction of information on activities of persons suspected of illegal activities from web open sources
Christian Fluhr, Aurélie Rossi, Louise Boucheseche and Fadhela Kerdjoudj

Abstract:
This work is part of the French funded SAIMSI project (Suivi Adaptatif Interlingue et Multisource des Informations). The aim of the project is to follow activities of persons suspected of illegal actions like terrorism, drug traffic or money laundering. The paper specially focuses on the information extraction. This extraction is done in French, English, Arabic and Chinese. The information extraction is based on a deep morphosyntactic analysis. Recognition of single words, idiomatic expressions, compounds is performed and named entities are identified and categorized. Dependency relations are built, passive/active forms, negation anaphora, verb tenses are processed. Information extraction is application-independent and uses extraction rules. At this level some named entity categories can be reconsidered. This extraction is based on a large ontology of the security. The paper details the problems of the consolidation of the extracted knowledge at the document level. The future evaluation on WEPS-3 data is presented.



Language Processing and Linguistic Data in the CAPER Project
Carlo Aliprandi, Tomas By and Sérgio Paulo

Abstract:
Much information of potential relevance to police investigations of organised crime is available in public sources without being recognised and used. Barriers to the simple and efficient exploitation of this information include that not everything is easily searchable, and may be written in a language other than that of the investigator. To help overcome these problems, the CAPER project aims to create an integrated platform for acquisition, processing, and analysis of information in multiple languages, and also link this to legacy police IT systems. Full Natural Language Processing pipelines for multiple languages and media are used to map persons and organisations to actions and events, and Multi-lingual lexicons and gazetteers allow cross-lingual search in the indexed data. Domain-specific lexicons contain words and slang expressions with special senses in the context of organised crime. The system supports multilingual analysis of unstructured and audiovisual contents, based on text mining for fourteen languages, and uses language-neutral interfaces, so that addition of further languages will not require any modification of existing components.



Detection of Child Sexual Abuse Media on P2P Networks: Normalization and Classification of Associated Filenames
Richard Beaufort, Alexander Panchenko and Cédrick Fairon

Abstract:
The goal of the iCOP project is to build a system detecting the originators of pedophile content on P2P networks such as BitTorrent, eDonkey, or Kad. This paper outlines the key functions of the language processing in the iCOP system. Next, we describe the architecture of the language analysis module and its key components - filename classifier, term extractor, and filename normalizer. The language resources used in each component are discussed. The paper is also presenting the first experiments with the module on the standard porn data (used in the preliminary tests as a substitute of child pornography data). Our results show that the module is able to separate titles of the pornographic galleries and videos from the titles of encyclopaedia articles with accuracy up to 97%. Finally, we discuss the directions for the future research and developments of the iCOP language analysis module.



Design of a Controlled Language for Critical Infrastructures Protection
Simona Cantarella, Carlo Ferigato and Evans Boateng Owusu

Abstract:
We describe a project for the construction of controlled language for critical infrastructures protection} (CIP). This project originates from the need to coordinate and categorize the communications on CIP at the European level. These communications can be physically represented by official documents, reports on incidents, informal communications and plain e-mail. We explore the application of traditional library science tools for the construction of controlled languages in order to achieve our goal. Our starting point is an analogous work done during the sixties in the field of nuclear science known as the Euratom Thesaurus.



Authorship Identification to Improve Public Security
Ales Horak, Karel Pala and Jan Rygl

Abstract:
In the paper, we present details of a new project aimed at automatic web document analysis for the purpose of authorship attribution based on various stylistic and grammatical features of the text. We describe the corresponding system modules with their expected functionality and provide examples of text processing and evaluating techniques.



Unexpected Factual Associations Mining
Wiesław Lubaszewski and Michał Korzycki

Abstract:
The paper describes the LSA (Latent Semantic Analysis) algorithm as a tool for mining unexpected factual associations from text corpora. Due to the fact that LSA performs well on text corpora built from short texts it can be a useful tool to analyse e-mails stored in the mail box, chats logs or Internet fora content. Therefore the LSA may serve as a tool in forensic or security analysis.



Precedes: A Semantic Relation in FrameNet
Miriam R L Petruck and Gerard de Melo

Abstract:
Precedes: A Semantic Relation in FrameNet Miriam R. L. Petruck and Gerard de Melo International Computer Science Institute Berkeley, California, USA miriamp@icsi.berkeley.edu, demelo@icsi.berkeley.edu Abstract Automatic language processing systems depend on, among others factors, the effectiveness in modeling human cognitive abilities, including the capacity to draw inferences about prototypical or expected sequences of events and their temporal order. Appropriate response to a crisis is as important for public security as are efforts to prevent any such natural or man made disaster. Recent research (Mehrota et al. 2008) has recognized the need for accurate and actionable situation awareness during emergencies, where timely status updates are critical for effective crisis management. The present paper constitutes a contribution to situation awareness for Natural Language Processing (NLP) applications to improve communication among first responders, and features the frame-to-frame semantic relation Precedes, as implemented in FrameNet (http://framenet.icsi.berkeley.edu). Specifically, this work demonstrates the necessity and importance of the information encoded with Precedes for NLP applications, advocating the inclusion of such information in systems for security applications.



Expressive speech synthesis database for emergent messages and warnings generation in critical situations
Milan Rusko, Sakhia Darjaa, Marian Trnka, Miloš Cerňak

Abstract:
Automatic information and warning systems can be used to inform, warn, instruct and navigate people in dangerous and critical situations, and increase the effectiveness of crisis management and rescue operations. One of the activities in the frame of the EU SF project CRISIS is called “Extremely expressive (hyper-expressive) speech synthesis for urgent warning messages generation”. It is aimed at research and development enabling the possibility to design speech synthesizers with high naturalness and intelligibility in Slovak which will be capable of generating messages with various expressive loads. The synthesizer will be applicable to generate warning system messages in case of fire, flood, state security threats, etc. Early warning in relation to the above can be made thanks to fire and flood spread forecasting; modeling thereof is covered by other activities of the CRISIS project. The most important part needed for synthesizer building is the speech database. A method is proposed to create such a database. The first version of the expressive speech database is introduced and first experiments with expressive synthesizers trained with this database are discussed.



Language Resources in a Public Security Application with Text Understanding Competence. A Case Study: POLINT-112-SMS
Zygmunt Vetulani

Abstract:
The aim of this paper is to show the importance of language resources in the development of complex, public security oriented applications with natural language understanding components as essential parts of the system. We present a case study of a mature project in the public security sector. This case study aims at giving an idea of the spectrum of needs and problems, without pretention to exhaust the topic. As it is typical for public security oriented projects, besides usual problems due to the gaps in available language data (resources), designers and developers of the presented system needed to deal with sensible data necessary for efficient language modeling. To make the paper self-contained, we start with a compact presentation of the POLINT-112-SMS system. Then we present the language resources we used.

AUTHOR INDEX

Aliprandi, Carlo
Balcerek, Julian
Beaufort, Richard
Boucheseche, Louise
By, Tomas
Cantarella, Simona
Cerňak, Miloš
Cieri, Chris
Cohen, K. Bretonnel
Czarnecki, Wojciech
Darjaa, Sakhia
Dąbrowski, Adam
de Melo, Gerard
Drgas, Szymon
Fairon, Cédrick
Ferigato, Carlo
Fluhr, Christian
Geoffrois, Edouard
Horak, Ales
Kerdjoudj, Fadhela
Kochanowski, Bartłomiej
Korzycki, Michał
Lubaszewski, Wiesław
Owusu, Evans Boateng
Pala, Karel
Panchenko, Alexander
Paulo, Sérgio
Pawłowski, Paweł
Petruck, Miriam R. L.
Rusko, Milan
Rossi, Aurélie
Rygl, Jan
Temnikova, Irina
Trnka, Marian
Vetulani, Zygmunt
Vetulani, Zygmunt et al.


 
WELCOME TO LRPS 2012, ISTANBUL, TURKEY