1 CitEc to CitEcCyr. A stab at distributed citation systemsThomas Krichel RANEPA & Open Library Society Köln
2 acknowledgement Work described here has been funded by the Russian Academy for the National Economy and State Service more precisely known as «Российская академия народного хозяйства и государственной службы при Президенте Российской Федерации» I have benefited from some exchange with Min-Yen Kan.
3 talk structure general background. a look at Russian references.[The longer I talk, the less I know what I am talking about.]
4 CitEc A system to do autonomous citation indexing for documents in the RePEc digital library. Founded and operated by José Manuel Barrueco Cruz since 2001. It does parsing of references from full-text working papers parsing of references strings or structured reference data provided by publishers citation data for RePEc. Public data is exported to other RePEc services.
5 input sources Bibliographic data from RePEc, may contain reference strings Some full-text papers that are not freely available for scanning of references. Stored back copies of RePEc working paper full-text. Importance of full blown autonomous reference parsing is in steady decline.
6 tools MySQL database poppler PDF text extractor ParsCitin-house Perl scripts
7 sustainability Server sponsored by INOMICS GmBH.No other external funding. Mainly work contributed by José Manuel Barrueco Cruz (JMBC). Not likely to follow the like “op cit” and CiteBase.
8 CitEcCyr A funded project that combine the resources of two unfunded project CitEc SocioNet [next slide] Basic objective is to use CitEc technology stack to build a citation database for Russian publications. The name is still confusing.
9 two-part system SocioNet, despite its name, has evolved into a cross-disciplinary aggregator with heavy Russian presence. CitEcCyr will handle the Russian papers and papers in Russian from RePEc that CitEc is handling very badly at this time. We are at the verge of a decentralized system.
10 aim for openess The aim is for software and data to be reusable.Outputs should reproducible. We also need some form of coordination between nodes. For that we already have some simple protocols.
11 Fraga A simple metadata format that specifies the results of automated citation analysis. Fully implemented for CitEc. Documents not processed by one node can be picked up by another node.
12 Lafka A protocol to gather full-text files and store them in WARCS.Partly implemented software is in operation on RePEc and some SocioNet contents. The CitEc public full-text file will be merged into the RePEc Lafka collection to form a main component of ArchEc … at same stage.
13 DiCit A simple replacement to save us the pains of OAI-PMH.Relies on XML, RelaxNG and rsync. The idea is that rsync is used on files in XML format, the format of which is specified in RelaxNG files. A node can be specified in a single XML file. Fully implemented for a part of the SocioNet data.
14 CitEcCyr Other aspects of CitEcCyr include the usage of citation data to feed annotation services for documents. JMBC's and my job is to build a Russian citation index. For me that means building an index of Russian language-reference strings citing Russian document. We may do other languages that use Cyrillic letters and that use a similar citation style.
15 state of play (maybe) Scopus and WoS don't do any indexing of document described in Cyrillic. References to these documents have to be transliterated. There is a tendency to translate and/or transliterate the references. There is also Российский индекс научного цитирования, but it is not an open project either.
16 state of work: ParsCit CitEc uses ParsCit basically out of the box, it just works. This is pretty amazing since it essentially uses a built-in binary crf++ model. That is possible because ParsCit was built with computer science in mind. The citation style is similar to the one in economics. Written in Perl but understandable.
17 ParsCit specificity I have studied the section parsing part.Regular expressions that need changing can't be done without changing the source code. I wrote a library. Complicated structure of dictionaries make them hard to extend. Parts of the code a literally duplicated rather than placed into libraries. Parts of the code deal with Omnipage, and that's something we don't have.
18 data source НЭИКОН have a repository.The metadata has about 900k reference strings. However, that data also contains references to non-Russian language documents, mostly written in the language of that document. I suspect that most cases of transliteration or translation occurs in non-Russian documents written by Russian authors when they refer to a Russian paper.
19 limitations We need to parse references for authors, titles and years.If a reference does not have all of these, we discard it. Thus we don't look at laws and other government documents patents
20 rombas I exclude reference strings that don't contain a Cyrillic char.In order to deal with translation, transliterations and to exclude non-Russian references, I create the “roman bags”, aka rombas. I partition the references by the number of Roman chars they contains.
21 romba stats 0000.txt: 0001.txt: 12676 0002.txt: 10940 0003.txt: 6683 0004.txt: 4085 0005.txt: 2479 0006.txt: 2124 0007.txt: 1632 0008.txt: 1132 0009.txt: 984
22 example 1 Кон Е. Л., Фрейман В. И., Южаков А. А.
23 example 2 Дмитриченко, Э. И.
24 example 3 Ласкин М.Б., Русаков О.В., Джаксумбаева О.И., Ивакина А.А.
25 example 4 Марченков Ю.В., Рябчиков М.М., Шульгин М.А.
26 authors The fact that no first names are used makes it easier to track authors. I got a started stock of Russian surnames from a web site. Then scanned the references for further surnames as co-authors. I have a risk of overfitting. I will also use the data from the bibliography.
27 years Easy to spot as a year number at the end of the string.ParsCit uses a “location” indicator that goes from 1 to 12. I suspect that these are just categories and that crf++ does not actually use numbers.
28 titles Titles appear almost invariably to appear after the list of author. There is no distinct punctuation mark. Thus the end of the title is difficult to track.
29 tailicity The // or when represent, will give a good indication of the end of the title. This allows us to parse all references for term after the //. Look at all references that contain //, look at all the token, and evaluate how often they appear after the This is “tailicity” of the token.
30 thus potential featuressurnames initials // and / tailicity location year potential
31 token border The approach of ParsCit takes is to tokenize at spaces.This will not be able to work when fields are not separated by blanks. I have seen this in a few cases.
32 Спасибо за внимане! Томас Крихель http://openlib.org/home/krichel