Uncategorized

Web Corpus Construction

The same is true in corpus design, where no two developers will make the same selection of texts even if they are creating corpora for the same purposes, and every corpus developer will have different views about the kinds of contextual information to include, and the extent to which texts should be interpreted for the corpus user. There are choices at every stage in this process: The same kinds of choices have to be made when developing a corpus for use in ESP or EAP contexts, and in what follows I will discuss these decision-making processes with reference to the three academic corpora I have worked on: As an EAP practitioner I had taught academic listening and academic writing for many years, but my knowledge of lecture and seminar discourse was more or less limited to the lectures and seminars I myself had taken part in, my own experiences as a student of English language and literature , and as an audience member for guest lectures and conference presentations, both of which are different from regular academic lectures in terms of their purposes and the speaker-audience relationships.

Similarly my knowledge of student assignments was limited to the applied linguistics assignments I had set for my own students, and the assignments I had once written myself. My situation was, I think, little different from that of most teachers, students, and EAP textbook writers.

Design decisions in web corpus construction and their impact on distributional semantic models

Some genres in some disciplines might be accessible to some stakeholders sometimes, but even subject lecturers might not know much about the practices of their colleagues, and, contrary to good pedagogical practice, students are regularly required to produce genres they had never previously encountered. Even the texts and scripts in EAP textbooks often seem to reflect rather idealised notions of how students and university students communicate, giving advice on what the textbook writers think ought to happen, rather than what actually happens on degree courses in the disciplines.

However, when working with occluded genres there are a number of constraints on what material can actually be collected, and to what extent it is possible to control for contextual variables. It would be very difficult to represent all these variables equally well, however, as a choice of one variable, for example discipline, might greatly limit the chance of finding lecturers of both genders, with greater and lesser degrees of experience, in small and large classes, with and without interactivity, for example.

Yet these considerations are important, because if corpus holdings are going to be compared in terms of a variable such as discipline, the groups of texts in each discipline need to be broadly similar in terms of the other variables. BASE was designed to represent lectures and seminars in equal quantities across four broad disciplinary domains: Other variables were not controlled, although in hindsight the design could have been improved by selecting an equal number of lectures and seminars from each year of study on undergraduate and postgraduate courses.

ELC has been designed to enable comparison of lecturing styles across different countries where English is used as a medium of instruction. For this reason only the lecture genre has been chosen; a decision was made to focus on only one discipline to prevent the project from becoming too complex. Roughly equal numbers of engineering lectures are being collected from different institutions — so far from universities in the UK, New Zealand and Malaysia. The corpus is expanding, so it may be possible to find very good matches across a large number of countries in terms of lecture topic and level.

Cultural similarities and differences are already emerging see, for example, Alsop et al. As in needs analysis, the process of text collection is often a journey of discovery: Only large and small lectures are represented, the small lectures being more interactive than the large ones. Within British university degree courses, however, lectures and seminars are usually treated as distinct events; lectures are largely monologic, while in seminars it is the students rather than the lecturers who do most of the talking.

The way seminars were conducted varied widely across the BASE corpus collection contexts, however; sometimes large classes were divided into small discussion groups, sometimes the entire class discussed together, and sometimes students took turns to deliver presentations, informally around a table, or more formally at the front of the class with slides.

ESP corpus construction: a plea for a needs-driven approach

It was difficult to plan in advance to collect a representative sample of each of these formats, because we did not always know in advance of data collection what the format would be. BAWE was developed with the intention of discovering the range of written genres students had to produce for degree programmes in Britain, but we could not plan in advance to collect a representative sample of each.


  1. Web Corpus Construction [Book].
  2. The Sarah Puzzle;
  3. Search form.
  4. Information on;
  5. King Hwuy of Leang (from The works of Mencius) & The Hsiao Ching.
  6. There was a problem providing the content you requested?

Instead we designed the corpus to reflect the variables we were sure of from the start: This distribution might be taken to reflect the relative importance of each genre in each discipline, but any assumption of this kind must be very tentative, because we did not sample in a stratified way from the entire population of texts. To represent additional categories in a balanced way would be far more costly, as it would entail collecting many more samples than were needed, and discarding those that were over-represented in any subcategory.

Boundaries between related fields of study are permeable, and within discipline-specific programmes there are often outlying modules, for example on the history of mathematics in a mathematics programme, or on business law for a degree in business. For the BASE corpus we decided to place speech events within domains according to their content rather than the organising department. Thus an ecology lecture delivered in a mathematics department was placed in the domain of Life Sciences, a linguistics lecture was placed in Physical Sciences because of its technical nature, and Philosophy and Typography speech events were placed in more than one domain.

A less interpretative procedure was followed for the BAWE corpus, on the grounds that we intended to capture the student experience of genre production in specific disciplines. Texts were assigned to domains according to the department where the student was enrolled, and departments at different institutions were merged, ignoring slight variations in their names, if they offered degrees in the same disciplinary areas. Apart from the loss of one seminar in the Physical Sciences, we managed to collect as much BASE data as we had planned, spread across the four domains.

The design of BAWE was more ambitious, and while we had originally aimed to collect 32 assignments from each year of study in each of the major disciplines, the final collection was slightly less evenly spread. We also found that we had to distinguish between assignments the complete pieces of work submitted for assessment and texts the generically distinct texts included within assignments ; figures for both are provided in Table 2.

Some of this participant information might be confidential, however, and even if participants are willing to provide a great deal of personal information about themselves there may not be time to collect and record it all, or to find useful ways of categorising the information for future reference. This was fairly easy to collect, as all the lecturers had public profiles.

Students would have found it inconvenient to provide us with personal information on an individual basis, as data collection would have had to take place outside class time, so we contented ourselves with broad general information about their level of study and the size of the class. This information was included in the header for each file, together with the recording date, discipline, module name and the title of the speech event.

Nevertheless a certain amount of language information was collected for the three corpora. BAWE contributors were asked to state their first language and also the number of years of secondary education they had received in the UK. This enabled us to differentiate between L1 English speakers educated in the UK and those educated elsewhere, and between L2 English speakers educated in other countries and those educated in the UK.

Enlocal Web Corpus Christi TX Website Design Presence

This language and education information can help us to identify distinctive patterns of use which are restricted to contributors with a particular first language. We have received a number of requests for information about the first languages of speakers in the BASE seminars, from researchers who would like to investigate their speech in a similar way. Unfortunately we did not keep a record of the first languages of the BASE seminar participants. In BASE and BAWE there are many more international students at masters level and in applied disciplines such as business, for example, than there are across the corpora as a whole.

ELC is designed to facilitate comparison between lecturing styles in different parts of the world, but we focus on the cultural context rather than the L1 of the participants; it appears that engineering lectures in New Zealand are somewhat different from engineering lectures in the UK, for example, even though all the lecturers and almost all the students share the same L1. However, different corpora, with different basic designs, are needed to explore the implications of these different types of information.

Documentation was needed at the design stage to identify target departments and modules, and we then called upon module leaders to permit us to record their lectures and seminars for BASE, and to help us make contact with students who might contribute their assignments to BAWE. Sources were also useful as an aid to categorising the texts we collected, and interpreting their meaning. These materials were based around hundreds of video clips of lectures and seminars recorded for BASE, but also included excerpts from video interviews with subject lecturers and students, which in turn informed our analysis of the corpus.

In particular, as reported by Nesi , the interviews revealed how departments perceived and differentiated the roles of lectures and seminars.

Secondary menu

We could not find much agreement, however, between different contributors submitting the same sort of assignment, in the same discipline, or between what contributors chose as the descriptor for their assignment and their own references to their work within the assignments themselves. One or two genres such as the problem question in law were well established and clearly defined by all participants, but for the most part the nomenclature did not seem to exist within departments to enable lecturers and students to distinguish between all the different types of assignments students were required to write.

The identification process was informed by the objectively observable structural and linguistic features of the texts, and by outside sources—the course documentation and advice from participant interviewees, as discussed in Nesi and Gardner , but it was essentially interpretative, and ultimately imposed on the data our own understanding of why the assignments had been written, and what skills and knowledge they intended to demonstrate.


  • Le statue parlanti (Italian Edition)!
  • Web Corpus Construction;
  • Derniers numéros.
  • Gluten Is My Bitch: Rants, Recipes, and Ridiculousness for the Gluten-Free.
  • Web Corpus Construction.
  • Darren and Basher (City Farm Book 3).
  • On the other hand, by annotating our corpora in this way, we have prepared the ground for future users so that they can, if they wish, narrow their investigations to specific genres in BAWE or specific lecture phases in ELC, without needing to repeat our initial laborious identification stages. Markup greatly facilitates the retrieval of texts and parts of texts, and makes it possible to reveal new distribution patterns across the entire corpus, for example by using the bespoke visualisation tool described by Alsop and Nesi Alternatively users can, of course, just ignore the pragmatic information, and focus solely on the observable objective characteristics of the texts.

    It has also discussed some of the issues surrounding the collection of contextual information and its inclusion in corpus resources. I believe that it is worth enhancing these resources, where possible, with the kind of information that the needs analyst arrives at through investigation of the context and interpretation of the textual evidence.

    In this way, we can provide our students with a genuinely useful supplement to the giants of the corpus world.

    Navigation

    ESP across Cultures 10, 8— In Calzolari , N. May , Reykjavik, Iceland, — MED Magazine , Issue Identifying Speech Acts in E-mails: Write a product review. Get to Know Us. Not Enabled Word Wise: Not Enabled Enhanced Typesetting: Not Enabled Average Customer Review: Be the first to review this item Would you like to tell us about a lower price? Delivery and Returns see our delivery rates and policies thinking of returning an item? See our Returns Policy. Visit our Help Pages. Audible Download Audio Books. Shopbop Designer Fashion Brands. Amazon Prime Music Stream millions of songs, ad-free.