Navigation auf uzh.ch
Schwerpunktthema:
Exploiting literature data
(daneben Berichte aus der aktuellen Forschung am Institut)
Zeit/Ort: Circa alle 14 Tage am Dienstag von 10.15 Uhr bis 12.00 Uhr in Raum BIN 2.A.10.
Dozierende: Dietrich Rebholz-Schuhmann, Michael Hess, Martin Volk
Kontakt: Dietrich Rebholz-Schuhmann
The Mantra project addressed solutions to improve entity recognition (ER) in parallel multilingual document collections. A large set of documents in different languages, i.e. Medline titles, EMEA drug label documents and patent claims, have been prepared to enable ER in parallel documents. Each set of documents forms a corpus-language pair (CLP) and the number of documents for each CLP vary from about 120,000 for patents up to 760,000 for Medline abstract titles. The documents (in different langauges) have been processed with annotation solutions and the annotations have been used to generate silver standard corpora (SSC). With the help of a gold standard corpus (in several languages), the SSC generation has been optimised to achieve the best possible results. The gap between the SSCs and the GSC will form the core of this presentation.
Dietrich Rebholz-Schuhmann is medical doctor and computer scientist with a PhD in immunology. He has headed a research team at the LION Bioscience AG (Heidelberg, 1998-2003) and at the European Bioinformatics Institute, Hinxton (Uk, 2003-2012). At the department for computational linguistics he is coordinating the EU-MANTRA project (2012-2014).
Don will present a hybrid pronoun resolution system for German. It uses a rule-based entity-mention formalism to incrementally process discourse entities. Antecedent selection is performed with Markov Logic Networks (MLNs). The hybrid architecture yields a neat problem formulation in the MLNs w.r.t. efficient inference complexity but pertains their full expressiveness. The system is compared to a rule-driven baseline and to an extension which uses a memory-based learner. The MLN hybrid outperforms its competitors to a significant degree.
Simone will demonstrate how the recognition of rhetorical structure and argumentation in scientific articles is a useful and achievable task, one that could potentially advance text understanding and that can be exploited in many applications. She describes Argumentative Zoning, a method of shallowly structuring a text into zones according to speech-act type status, and explains the theory behind it. As example serves the project FUSE that aims to detect new, emerging ideas in entire scientific disciplines. Simone will describe the natural language-based features - in particular, the rhetorical ones - that help in the prediction task. The talk concludes with current work on tracing citations in a newly-created corpus on the topic of RNAi. The idea here is to quantify and correlate a cited work's impact and status with various factors related to the rhetorical structure of the citing text, such as where in the text it is cited, and exactly how.
Martin holds a major in communication science and a minor in computational linguistiscs. Currently he pursues his PhD at the IPMZ (Institut für Publizistikwissenschaft und Medienforschung). His talk will be focused to the "Angrist 1.2", a solution (in Python) that produces query forms for relational data input in the content analysis. It is especially useful for the application of hierarchical codebooks, as it allows data entry at different levels in the analysis without increasing the cognitive load onto coders.
Nadine will give an introduction on how instances of named entities (e.g. persons, titles) are represented in the Text+Berg corpus and how her solution for improved NER is working. An evaluation will be given as well.
Automatic methods for the analysis of biomedical texts have matured considerably over the last 15 years.
As tools for basic tasks have become established, the focus of efforts in domain information extraction
has turned toward new challenges, such as detailed, ontology-based recognition and normalization of physical
entity mentions and complex processes involving multiple entities in a variety of roles.
This talk will present these and related trends in the context of the BioNLP Shared Task (BioNLP ST)
series of events, focusing on the Cancer Genetics and Pathway Curation event extraction tasks of BioNLP ST 2013.
Following an introduction to the event extraction task setting and representation, I will discuss the state
of the art in extraction methods and present manually annotated resources, available tools, and databases of analysis results.
Current applications of the extraction technology such as semantic search and curation support tools will be introduced,
with emphasis on remaining opportunities. Future directions for event extraction will be discussed with the theme of
"scaling up": from few specific entity and event types to hundreds; from the molecular scale to higher levels of
biological organization; and from small challenge datasets to analyses of millions of documents and databases
encompassing the entire available domain literature.
Dr Sampo Pyysalo has been working on the development of resources and methods for biomedical information extraction
with particular focus on supervised machine learning approaches, structured knowledge representations,
and large-scale text mining.
He has initiated and participated in the design and development of several annotated biomedical corpora,
including BioNLP Shared Task and GENIA resources, leads the development of the open-source annotation tool BRAT,
and has contributed to the creation of automatic structured analyses spanning the entire available
biomedical literature through the development and deployment of text mining tools at the University of Tokyo,
the UK National Centre for Text Mining, and the University of Turku.
He has been a co-organizer of conference, workshops and challenges in this domain.
The term hybrid machine translation refers to any combination of statistical MT with rule-based MT or example-based MT, or a mixture of all three approaches. In this talk, a hybrid MT system for the language pair Spanish-Cuzco Quechua will be presented. The core of the system is a classical, rule-based pipeline. However, as not all ambiguities can be resolved efficiently by rules, the system relies on statistic models for certain tasks.
Tilia presents the results of her ongoing work concerning the extraction of biomedical entities and the relations between them from the scientific literature. In addition to extracting chemical entities from the text, she now focuses on the identification of genes (aka. proteins). The entities are extracted based on a database for interactions between chemicals and genes (CTD, The Comparative Toxicogenomics Database) and with the help of scientific databases for proteins (UniProt, EntrezGene) and other semantic resources, such as ontologies (ChEBI, Chemical Entities of biological interest).
Tobias holds a master in computational linguistics from the University of Zurich and
currently pursues a PhD at the Institute of Computational Linguistics of the University of Zurich.
The talk will give an introduction into previous efforts to create a multi-parallel corpus
database for storing, combining and querying several layers of annotations and alignments in
order to empirically answer linguistic questions.
Emphasis will be put to the cleaning and turn-alignment of the Europarl Corpus
that is core to our database.
Laura holds a diploma in computer software and a master in information technologies,
both from Universitat Politècnica de Catalunya (Barcelona). She has specialized in
Natural Language Processing, Software Engineering and Information Systems. Currently she
is pursuing her PhD in machine translation (at cl@UZH).
She will present her work concerning the consistent translation of German compound coreferences
based on using two in-domain phrase-based SMT systems. In contrast to most
Statistical Machine Translation (SMT) systems, which do translate at the sentence level
leading to translation inconsistencies across the document, she presents a method
to enforce consistency across the whole document. Her experimental results demonstrate
that the correctness and consistency of compound coreferences can be improved.
Tobias holds a master in computer science from the University of Zurich and
a PhD in computer sciences from the Institute of Computational Linguistics of the University of Zurich.
Afterwards he engaged in a number of PostDoc endeavours at the University of Chile, of Zurich, of Malta,
of Helsinki, of Yale (Prof. Krauthammer), at the SIB, and now the EHT Zuerich (Chari of Sociology).
His research projects are concerned with computational linguistics, bioinformatics, simulation,
semantic web, social systems, controlled natural languages, and artificial intelligence.
This talk is about the automatic extraction of scientific concepts, i.e. memes, from large corpora of
scientific literature. This work shows that citation networks can provide powerful clues for interpreting large
quantities of scientific texts, in particular for observing trends, tracking ideas, and detecting research fields.
Our technique has the potential to improve existing approaches on terminology extraction, named-entity extraction,
topic modeling, and keyphrase extraction, but it also has a remarkable performance on its own.
We validated our simple meme formula with data from close to 50 million publication records from the Web of Science,
PubMed Central, and the American Physical Society. Evaluations relying on human annotators, network randomizations,
and comparisons with several alternative approaches confirm that our technique is accurate and effective, even without
including linguistic or ontological knowledge and without the application of arbitrary thresholds or filters.