A6 – Computational lexicography


The EMLex offers a diverse spectrum of teachers & lecturers from around the globe. This course will be held by:

Prof. Dr. Stefan Evert

Friedrich Alexander University Erlangen-Nuremberg

Prof. Dr. Ulrich Heid

University of Hildesheim

Dr. Besim Kabashi

Friedrich Alexander University Erlangen-Nuremberg



Topics to be treated in this module include:

  1. Foundations of corpus linguistics
    • principles and methods of corpus analysis
    • applications of corpus data in lexicography
    • types of corpora, overview of existing corpora
    • corpus design, representativity, data sources, metadata
  2. Corpus compilation
    • building corpora from online data: web scraping etc.
    • boilerplate removal, normalization, metadata extraction
    • representation and exchange formats
    • online and stand-alone tools for web corpus compilation
    • automatic linguistic annotation (POS, lemma, NER, parsing, …)
    • online and stand-alone tools for linguistic annotation
  3. Searching corpora
    • regular expressions
    • character encodings and the Unicode standard
    • CQP query language for lexico-grammatical patterns
    • practical exercises with Sketch Engine and CQP web
  4. Quantitative analysis
    • frequency lists and metadata distribution
    • collocations and word sketches
    • keyword analysis
    • lexicographic interpretation of results
    • foundations of statistical inference
  5. Reproducibility
    • research methodology and documentation
    • data management, sustainability of corpus resources

A version of the module description for further information can be found.


General information:

Time frame 22.03-26.03
Room 00.4 PSG
Evaluation method participation in a team project with a written report (the teams will be determined at the beginning of the module)
Teaching language German or English


Information on the EMLex 2021 Summer school:

Practical arrangements: Participants will receive a syllabus, relevant literature and suggestions on how to prepare for the course well in advance on the moodle plattform. The sessions are moderated by the lecturer and the guest lecturer. The lessons are centered around practical exercises with the computer, to be carried out in small groups (instructions will be given beforehand).


Certificate: There are two alternatives to get an EMLex 2021 Summer school certificate – (a) without grade: active participation in practical exercises, class discussions and a team project and (b) with grade: participation in a team project with a written report; the teams will be determined at the beginning of the course.



Time/day Monday Tuesday Wednesday Thursday Friday
9:00-10:30 Getting data from the Web (Evert) Corpus annotation: metadata and linguistic annotation (Evert/Heid) Basics of quantitative analysis: Types and tokens, frequency, frequency distributions, keywords, term candidate extraction (Evert/Heid) Student presentations and discussion:

Projects I (all)

11:00-12:30 Welcome, Lexicography and text corpora (Heid/Evert) Exercises on crawling, web scraping etc. (Evert/Kabashi) Corpus search. The CQP query language and Sketch Engine (Evert/Heid) Collocations (Evert/Heid) Student presentations and discussion:  Projects II (all)
Lunch break
2:00-3:30 Introduction to corpus linguistics principles and challenges of web corpora (Evert/Heid) Schierholz (A5) Practice: Sketch Engine for annotation and query (Evert/Kabashi) Practice: quantitative data and Sketch engine (Evert/Kabashi) Final discussion round (all)
4:00-5:30 Experimentation time (Evert, Kabashi, Heid) Schierholz (A5) Group work on projects: data crawling, scraping annotation Group work on projects (Evert/Kabashi)
6:00-7:30 Procedures of the course (Evert/Heid)

Guidelines for student projects (all)

Student presentations: first ideas on projects Question session on corpus provision and annotation (all) Question session on quantitative issues (all)