“Genre discovery” in a document management system

1 “Genre discovery” in a document management systemCULT –...
Author: María José San Martín Rodríguez
0 downloads 1 Views

1 “Genre discovery” in a document management systemCULT – BCN 2004 “Genre discovery” in a document management system Abaitua, Díaz, Jacob, Quintana[1] y Araolaza[2] DELi (Universidad de Deusto)[1], CodeSyntax[2] DELi

2 Contents Case study: University of Deusto ObjectivesSARE-Bi: a mulitilingual corpus management system Document classification: Functions, genres and topics Metadata: TEI, TMX, XLIFF Future developements CULT – BCN 2004

3 Case study: UD Official bilingualism (trilingualism for the web)Almost 100% of original writing in Spanish Basque: minority even in EH Passive biling.: many can read/understand, only a few can write Target-users and readers? departments (e.g. 20 people) Univ. staff (1,000 people) students (20,000 people) CULT – BCN 2004

4 Case study: UD Multilingual publishing Administrative documentsgenerates high number of administrative documents most of them in Spanish and Basque (euskara), some also in English, French, Italian... Administrative documents large (statutes, regulations, reports...) small (calls, announces, minutes, letters...) short messages (“Inquires in room 422. Sorry for any inconvenience”) CULT – BCN 2004

5 Case study: UD Translation procedure (inefficient)original document (in one language) the writer mails it to “translators” “translators” produce other language versions translations mail back to the “writer” writer “prints” the multilingual document CULT – BCN 2004

6 Objectives: Implement a more efficient publishing process: Multilingual publication procedure Rapid delivery of multilingual documents Develop a system for corpus management repository + document life cycle Design a taxonomy for document classification use of metadata (for document classification) CULT – BCN 2004

7 Objectives: Multilingual publication procedurein the chain: composition > translation > publication; translating is not enough eg. requires more functions than those offered by MT: revision, adaptation, versioning, classification, reutilization, standardisation users: writers, translators, editors, documentalists, publishers, readers web-centric, work-flow, document sharing other uses: education, translators training, documentalists CULT – BCN 2004

8 SARE-Bi (1): a document management systemDocument-base cumulative document repository classified through metadata Multilingual functionality textual correspondence between documents and segments Collaborative system users share documents + working space work-flow control (X-Flow project, 2002/03) CULT – BCN 2004

9 SARE-Bi (2): translation memoryExperience automatic extraction of translation memories from bilingual (es-eu) docs (XTRA-Bi project, ) several Gigabytes of TMX files unorganised chunks of texts segments Multilingual segmented document system not only the document as a whole if we show the corresp. of multilingual segments then the system is also a translation memory (TMX) repository CULT – BCN 2004

10 SARE-Bi (3): metadata Metadata TEI (Text Encoding Initiative)document = content + metacontent semantic web, ontologies, content syndication... XML technology TEI (Text Encoding Initiative) not so much for the purpose of linguistic mark-up for structural and cataloguing aspects (TEI header) TMX, XLIFF for TM exchange and work-flow control CULT – BCN 2004

11 SARE-Bi: a first tour SARE-Bi Demomultilingual document management system allows incremental compilation of documents allows users to work collaboratively uses metadata as a conceptual mechanism can also be seen as a memory-based machine translation system Demo CULT – BCN 2004

12 SARE-Bi: functions Retrieving docs. filtering searchingbased on metadata searching free text any language CULT – BCN 2004

13 SARE-Bi: filter resultsA row for each document visualisation link modification link CULT – BCN 2004

14 SARE-Bi: visualisationExport tool TEI & TMX Complete doc. to retrieve full contents Segmented doc. to see language correspondence CULT – BCN 2004

15 SARE-Bi: search resultsFound segments in all document languages equivalent to translation memory browsing Includes visualisation link CULT – BCN 2004

16 SARE-Bi: adding a document (first step)User provides: values for metadata languages of the document (may be just one) CULT – BCN 2004

17 SARE-Bi: adding a document (second step)User input Metadata management Segmentation and alignment user can verify that these tasks are OK Same page for document modification CULT – BCN 2004

18 SARE-Bi: components (general)Corpus of multilingual documents annotated (TEIsh), segmented, and aligned segments are paragraphs Metadata associated to each document guidelines of the TEI header usual data: title, dates, author, place, centre... Most important metadata: category, state, visibility CULT – BCN 2004

19 SARE-Bi: metadata (state and visibility)Dynamic behaviour users change state/visibility during the edition cycle to show the composition/multilingual condition of the document metadata other than these are static (fixed values) State non-validated, validated, normative Visibility rough draft, confidential, shared, public CULT – BCN 2004

20 SARE-Bi: components (users)Mainly associated to tasks in the system guests, writers, translators, administrators But also related to permissions document owner: user that added it Complex set of permissions a rule for each task, that involves: owner metadatum state metadatum visibility CULT – BCN 2004

21 SARE-Bi: metadata (classification of documents)Hierarchical taxonomy of several levels (based on Trosborg 1997) 1st version of taxonomy only: genres (45) topics (150) 4th version of taxonomy: communicative function (3) genre (25) topic (250) CULT – BCN 2004

22 SARE-Bi: metadata (classification of documents)Hierarchical taxonomy at 3 levels e.g. a subscription reply card has: 3-function inquirir 11-genre ficha 09-topic boletín subscripción 30000/inquirir 31100/ ficha 31101/ aceptación o renuncia de beca 31102/ boletín de inscripción 31103/ datos de viaje 31104/ modelo de pago 31105/ relación de coordinadores departamentales 31106/ planificación actividad de profesores 31107/ prácticas 31108/ datos estadísticos 31109/ boletín subscripción revista 31200/ impreso 31201/ de solicitud de beca 31202/ de solicitud de expediente 31203/ de solicitud de admisión 31204/ de solicitud de alojamiento 31205/ de programa Sócrates 31206/ de matrícula 31207/ factura 31208/ recibí 31209/ petición de fotocopias CULT – BCN 2004

23 SARE-Bi: metadata (classification of documents)Hierarchical taxonomy at 3 levels e.g. a subscription reply card has: 3-function inquirir 11-genre ficha 09-topic boletín subscripción 30000/inquirir 31100/ ficha 31101/ aceptación o renuncia de beca 31102/ boletín de inscripción 31103/ datos de viaje 31104/ modelo de pago 31105/ relación de coordinadores departamentales 31106/ planificación actividad de profesores 31107/ prácticas 31108/ datos estadísticos 31109/ boletín subscripción revista 31200/ impreso 31201/ de solicitud de beca 31202/ de solicitud de expediente 31203/ de solicitud de admisión 31204/ de solicitud de alojamiento 31205/ de programa Sócrates 31206/ de matrícula 31207/ factura 31208/ recibí 31209/ petición de fotocopias CULT – BCN 2004

24 Classification proceduresCategorisation into “concept” hierarchies (Sebastiani 1999, Bouquet et al 2003) “into topical categories on the basis of content [...] within the general machine learning paradigm” “semantic mappings across hierarchical classifications of content” Library cataloguing systems: MARCS, UDC metadata (author, title, series, subject, physical description) subjects (e.g. 8 Language, 82 Literature, Translation) Text typology (Trosborg 1997): speech acts, communicative funcitions, genres CULT – BCN 2004

25 Classification Hierarchies – CH (Magnini 2003)Taxonomic organization of documents Easy to build: no formal language is required Widespread used: Web directories (Google, Yahoo!, Looksmart, portals) Market place catalogues for product classifications File systems Local Ontologies Documents are classified at all levels of the hierarchy CHs structure reflect both the documents and world knowledge CULT – BCN 2004

26 CH (Magnini 2003) Semi-structured: relations among nodes are not formally defined. Document dependent: CHs are organized according to the documents that have to be classified. Specificity criterion: a document is classified in the more specific node of the hierarchy. Vacation 2001 2000 Mountains Sea Sea Lake Tuscany Spain USA CULT – BCN 2004

27 CH: e.g. organizing papers on a file system:Work Knowledge about the domain is used Classification schema are repeated Labels are interpreted in their context (Magnini 2003) WSD QA Experiments Projects Papers Senseval-2 ACL-02 Submission Camera ready Submission CULT – BCN 2004

28 Interoperability among CHs (Magnini 2003)Scientific interest. Various terms have been recently used, including: Meaning negotiation Semantic coordination Mapping between domain models Semantic mediation Ontology merging, integration or alignment Integration of hierarchical categorization Fits well in the Semantic Web perspective Commercial interest: Distributed Knowledge Management in corporations Common goal: find mappings between nodes of two classification hierarchies CULT – BCN 2004

29 Interoperability among CHsSource CH Target CH Vacation Sea holidays 2001 2000 Mountains Sea Sea Lake Italy in Europe Tuscany Spain USA CULT – BCN 2004

30 Interoperability among CHsSource CH Target CH Vacation Sea holidays 2001 2000 Mountains Sea Sea Lake Italy in Europe Tuscany Spain USA CULT – BCN 2004

31 Matching Google and Yahoo! : (Magnini 2003).88 (.93) (.43) .60 (.67) (.69) .78 (.71) (.10) Pr Re. Medicine .85 (.96) (.48) .51 (.61) (.62) .71 (.60) (.10) Pr Re. Architecture More specific More general Equivalence Google: Architecture/History/Periods_and_Styles/Gothic Is More specific than Yahoo: Architecture/History/Medieval CULT – BCN 2004

32 Experiments Web directories: build a reference benchmark for evaluating matching algorithms. Include Looksmart Google English vs Google Italian File systems Collaboration Edamok, SWAP, MEANING Domain specific applications Medical classification: integration of UML in the algorithm Public Administration: matching document classification hierarchies for automatic routing CULT – BCN 2004

33 SARE-Bi: adding a document (document classification: metadata)Title Languages Text cat. Date Author Place Center Collection Visibility CULT – BCN 2004

34 SARE-Bi: metadata (Text categories)Hierarchical taxonomy of 3 levels communicative function genre topic (Trosborg 1997) 30000/inquirir 31100/ ficha 31101/ aceptación o renuncia de beca 31102/ boletín de inscripción 31103/ datos de viaje 31104/ modelo de pago 31105/ relación de coordinadores departamentales 31106/ planificación actividad de profesores 31107/ prácticas 31108/ datos estadísticos 31109/ boletín subscripción revista 31200/ impreso 31201/ de solicitud de beca 31202/ de solicitud de expediente 31203/ de solicitud de admisión 31204/ de solicitud de alojamiento 31205/ de programa Sócrates 31206/ de matrícula 31207/ factura 31208/ recibí 31209/ petición de fotocopias CULT – BCN 2004

35 SARE-Bi: Categories genres“reflect differences in external format and situations of use, and are defined on the basis of systematic non-linguistic criteria” (Trosborg 1997) “coded and keyed events set within social communicative process”(Todorov 1976, Fowler 1982, Swales 1990). UD-corpus: 25 genres Not effective for rapid interaction CULT – BCN 2004

36 SARE-Bi: Categories genres11000/autorización 11100/acuerdo 11200/instrucciones 11300/normativa 11400/bases 11500/plan 11600/ceremonial 21100/aviso 21200/carta (está firmada) 21300/saluda (no se rubrica) 21400/certificado (por) 21500/convocatoria 21600/tarjeta de invitación 21700/folleto (imprenta) 21800/guía 21900/memoria 22000/catálogo 23000/actas 23100/anuncios en prensa 23200/carteles de propaganda 23700/nombramientos 31100/ficha 31200/impreso 31300/cuestionario 31400/instancia CULT – BCN 2004

37 SARE-Bi: Categories genres divided into topics21400/certificado (por) 21401/matrícula de curso 21402/asistencia a curso 21403/participación en curso 21404/plaza en programa 21405/admisión en estudios 21406/derechos de título pagados 21407/asignaturas de carrera superadas y prueba de conjunto pendiente 21408/asignaturas de carrera y prueba de conjunto superadas 21409/superación de pruebas 21410/suficiencia investigadora 21421/oyente en actividad (congreso, jornada, seminario...) 21422/organizador de actividad 21423/ponente en actividad 21424/evaluador en actividad 21425/miembro de comité científico en actividad 21441/participación en informe 21442/participación en proyecto de investigación 21443/financiación para proyecto 21444/participación en comisión 21445/prácticas 21446/solicitud de beca 21447/especialidad-itinerario CULT – BCN 2004

38 SARE-Bi: Categories Communicative functionsclassification according to the purpose of the dicourse (aka rethorical strategies) ¿the discourse intends to inform express an attitude persuade create a debate ? UD documents: regulate informe request (for information) Longacre (1976, 1982), Smith (1985) and Biber (1989) CULT – BCN 2004

39 SARE-Bi: Categories genres grouped by functions10000/reglamentar 11000/autorización 11100/acuerdo 11200/instrucciones 11300/normativa 11400/bases 11500/plan 11600/ceremonial 30000/inquirir 31100/ficha 31200/impreso 31300/cuestionario 31400/instancia 20000/informar 21100/aviso 21200/carta (está firmada) 21300/saluda (no se rubrica) 21400/certificado (por) 21500/convocatoria 21600/tarjeta de invitación 21700/folleto (imprenta) 21800/guía 21900/memoria 22000/catálogo 23000/actas 23100/anuncios en prensa 23200/carteles de propaganda 23700/nombramientos CULT – BCN 2004

40 SARE-Bi: adding a document (category selection)Menu-driven selection: communicative function genre topic (name) CULT – BCN 2004

41 SARE-Bi: implementationWeb application (based in Zope server) multilingual (es-eu-en localised) web interface optimal information/contents management complex system of user management Object-oriented database classes: documents, subdocuments, segments attributes: metadata (managed in disjoint sets) Full XML functionality export into TEI and TMX formats CULT – BCN 2004

42 SARE-Bi: conclusions In full experimental use since May 2003System’s new features (X-Flow, OAC projects) Work-flow control document versioning (XLIFF) automatic document categorisation discourse segmentation (RST) open taxonomy ML protocol for metadata harvesting (OAI-PMH) On Internet: CodeSyntax CULT – BCN 2004

43 SARE-Bi: conclusions SARE-Bi has been funded by: AcknowledgementsAutonomous Basque Government Dept. of Industry (project X-Flow, ) Dept. of Education, Universities, and Research (project XML-Bi, PI , ) CodeSyntax (Eibar, Spain) Acknowledgements Josu Gómez, Arantza Domínguez (DELi, UD) Luistxo Fernández, Eneko Astigarraga, Roberto Quero (CodeSyntax) CULT – BCN 2004

44 “Genre discovery” in a document management systemCULT – BCN 2004 “Genre discovery” in a document management system Abaitua, Díaz, Jacob, Quintana[1] y Araolaza[2] DELi (Universidad de Deusto)[1], CodeSyntax[2] DELi