Dynamic Data in the humanities Marc Kemps-Snijders
[email protected] EUDAT Dynamic Data Amsterdam September 25th 2014
Dynamic data approach Observation time 45°
Ideally, data is stored the moment it is observed
Archive Ingest time
Usually, data arrives late
…….or never at all
From: EUDAT meeting September 2013 Barcelona
Creating uniformity and standardization for heterogenity of collections Digitization started around 2000 - Scientists and general public Provide accurately dated title, author and geographical information
85.957 titles 92.276 authors 157.432 dependent titles
Over 84 M unique articles from 1618 to 1996 20 B words
Over 2 M pages in 10.000 books the period 1781 to 1800
Data behaviour in a humanities Virtual Research Environment Time Author 1587-1679 Book 1623 Data arrives VERY late
Records are often related, e.g. books and authors Archive Ingest time
Data needs to be curated……
Sept 2013
SAME record
Phenomena are recorded in single record (metadata)
Metadata curation example 1. Sometimes authors appear twice in our system, e.g. due to spelling variants or name variants. In the 16th century authors sometimes published under their motto rather than their own name Example: • „Liefde verwinnet al‟ (Love conquers all) • „door Eén is 't nu voldaen‟ (by One it is all done now) Joost van den Vondel 17 November 1587 5 February 1679 (aged 91)
Versioning and reproducibility Time Joost van den Vondel 1587-1679
45°
Het lof der zeevaert Poem 1623
Oct 2013
AIT: Nov 2013 Query 1: Exp: -
Jan 2014 Query 1: Archive Ingest time
Lucifer Drama 1654 Reproducibility prevents objects from being thrown away
AIT:Oct 2013 Exp: Jan 2014
How many titles are available for Vondel? Answer: one Query 1 is not reproducible Add Archive Ingest time stamp Add expiration time stamp
AIT:Jan 2013 Exp: How many titles are available for Vondel? Answer: two
Select title where ArchiveIngestTime(title) < ArchiveIngestTime(query) and ExpirationTime(title) > ArchiveIngestTime(query)
Data curation example 2. Editions provide an additional challenge • Recently published • Consists of fragments of modern and old Dutch Published 1613
editions are to be split Published 1623 up into source texts and editorial para texts J. Van den Vondel Twee zeevaart gedichten
Marijke Spies
Hymnus….. Joost van den Vondel
Lof der zee-vaert Published 1987
Data curation example 3. Published June 14th 1618
OCR digitized newspaper articles sometimes prove to be of poor quality, e.g. • Older articles • WW II articles
Crowd sourcing project are underway to provide accurate VVt VVtVenetien Venetienden den1.Iunij, I.Iunij, Anno Anno 1618. 1618. transcriptions 'sxDEn En 25 adviseert wozdel,/van .,et gcyor uerracc aihi.'trwelc / twc 25.Mssaro Passatoisis3^geadviseert worden, vanH^ het groot verraet alhier, Collaboration with Royal Library
l>ontdeckt / zynde is, vele d« r srlver gtlustlreert duer onderdaer eeulghe Franc0i>scn/die stch zijnde vele der selver gerusticeert onder eenighe Francoys deSpaellschcn eenlghen ende d.ftr «lödellupden verdondcn dcseverdonden Stadt aen de 50 die sich met decndc Spaenschen eenighen deser Edelluyden ende meer lc stchen/ ende in re brant plunderen ghelncllmendanaense» Stadt aenin50bzam plaetsen ende meer te steken, ende te plonderen,her p de met vicrwerr heest w^lctle hunnermet mede gesellen el n ghelijck men dan aen glMonden/het seker plaetsen by deccnc 50. potten vierwerc heeft mlldccllr heeft het / den welc- eene Kcn sp 2f.duuscnt vereen: Alsontde sulc ghevonden, welcke hunner mede ducaten gesellenhebben aen deser Seign. hebben vernomen/znnderbp 70l>.wechghtloopcn Doch vanglvanzihcn / ende heeft, den welcken sy 25. duysent ducaten hebben vereert: Als sulckx die 40. uan V^vua al vernomen, hier ghcdzacht^oock noch Wech dagnelhcnr van daer ende andere hebben zijnder by 700; gheloopen. Doch 20. Veron daer v Bcrgamo / en ende andere plaetsen ghevanckellicn gcvzacht werden: dese ol>ser gevanghen, dese daghen 40. van Padua al-hier ghebracht, oock noch dledacr toe gheholpen/zijn nachts van wegen harcr ende grooter vrienden ver» daghelijckx van daer ende des Verona, Vicenza, Bergamo, andere plaetsen wo)5en/cut>c Komen daghclc)cllt aendaer den toe dach / so ghevanckelijck gebracht werden:noch deseWouderlycne onser Natien sanen alhier die dat deSpaensche dele Stadt alsomncme wilde gheholpen, zijn des nachts van wegen harer grooter vrienden verdroncken
Annotations Linguistic annotations are at the heart of scientific data processing, e.g. Part of Speech tagging, Named Entity Recognition, Syntactic analysis, Coreference, Semantic Role Labeling. Ga er nog eens op uit in Amsterdam! Lemma=“Amsterdam” Postag= SPEC(deeleigen) 1 Ga gaan [ga] WW(pv,tgw,ev) 0.993151 0 ROOT Postag=N(eigen,ev,basis,onz,stan) 2 er er [er] VNW(aanw,adv-pron,stan,red,3,getal) 0.972222 1 mod 3 nog_eens nog_eens [nog]_[eens] BW()_BW() 0.980727 1 mod 4 op op [op] VZ(fin) 0.920000 1 pc 5 uit uit [uit] VZ(fin) 0.936170 4 hdf Lemma=“ga” 6 in in [in] VZ(init) 0.998321 1 mod Lemma=“uit_gaan” 7 Amsterdam Amsterdam [Amsterdam] SPEC(deeleigen) 1.000000 0 ROOT 8 ! ! [!] LET() 0.995005
Frog
Alpino
Word="Ga”
Frog Alpino
Most tools need to be trained or are designed to deal with specific language periods (commonly modern language). The result often needs to be manually corrected. Interoperability across tools is often an issue (tagsets and processing methods).
Annotations Ideally produce • Training corpora (manually corrected) • Preprocessed annotated data (sometimes using different tools) • (Manually) corrected annotated data Book Based on
Training corpus Manually corrected
e.g. from the same time period
Training corpus
Used trainingscorpus
Book Processed resource
Annotation
Annotation
Manually corrected
Nederlab Virtual Research Environment With over 37.5 M documents and 1.277.188.758 words currently available in the environment this becomes quite a difficult process to manage. And we have ongoing discussions on acceptable methods for maintaining this environment over prolonged periods of time. • How to handle dynamic behaviour of data? • Under which conditions can data be phased out? • Should ALL data be integrated into the environment? At least for metadata management a separate editorial environment has been set up to limit the amounts of potential updates (and versions) in the system.
Nederlab Virtual Research Environment Harmonization tool
Metadata editor
Over 2 M pages in 10.000 books the period 1781 to 1800
VRE
Concluding remarks Efficient versioning appears to be the key towards dynamic data management • Maintain version history • Assign appropriate time stamps • When dealing with large quantities of data decide upon criteria for phasing out of data • When dealing with heterogeneous collections from different sources, including automated enrichment processes, great care must be taken to maintain overall data integrity – Both data and metadata may be affected – Must be evaluated on a case by case basis – In our domain data dynamics is not limited to a single project or organization!!! Data may originate from different overlapping sources and different approaches may have been applied (e.g. data enrichment processes)
Thank you for your attention
Marc Kemps-Snijders
[email protected] EUDAT Dynamic Data Amsterdam September 25th 2014