DAG voor STATISTIEK en BESLISKUNDE. donderdag 3 maart Jaarbeurs (Beatrixgebouw), Utrecht. Jerome Friedman. hoofdsprekers

2011

DAG voor STATISTIEK en BESLISKUNDE

donderdag 3 maart 2011

Jaarbeurs (Beatrixgebouw), Utrecht

hoofdsprekers

Jerome Friedman

& Maurizio Vichi

DAG VOOR STATISTIEK EN BESLISKUNDE donderdag 3 maart 2011 in Utrecht

2011

VVS

Vereniging voor Statistiek en Operationele Research, Postbus 244, 6700 AE

Wageningen, telefoon 0317-419572, e-mail: [email protected]. Raadpleeg onze site www.vvs-or.nl over hoe u lid kunt worden van de VVS of een abonnement kunt nemen op een van de VVS-periodieken.

Algemene Leden Vergadering

Toegang

De lezingen zijn toegankelijk voor leden van de Vereniging, sprekers en andere genodigden. Belangstellende niet-leden kunnen toegang krijgen door zich vooraf als lid aan te melden op de website www.vvs-or.nl.

De vergadering vindt plaats aan het einde van het ochtendprogramma.

De vergaderstukken kunnen vanaf 2 weken voor de vergadering worden

gedownload van de website van de VVS-OR, of worden opgevraagd bij de secretaris van de vereniging.

Locatie

Wandelgangen

complex van de Jaarbeurs (Beatrixgebouw) te Utrecht, Jaarbeursplein 6, 3521

plaats voor mensen die de statistiek en de operations research een warm

De Dag voor Statistiek en Besliskunde zal dit jaar plaatsvinden in het zalenAL Utrecht. Het ochtendprogramma vindt in zaal 417 (plenaire zaal) plaats; de middagsessies in de zalen 415, 416 en 417. Alle zalen bevinden zich op

de 4de etage van het Beatrixgebouw. Zie www.jaarbeursutrecht.nl voor de route of raadpleeg de plattegrond achterin.

Aanmelden

In verband met de catering is het essentieel dat alle bezoekers zich vooraf aanmelden via het aanmeldformulier op de website www.vvs-or.nl.

De bevestiging van de inschrijving dient als bewijs van toegang. Aanmelden op de website kan uiterlijk tot en met dinsdag 22 februari 2011.

De Dag voor Statistiek en Besliskunde van de VVS-OR is dé ontmoetings-

hart toedragen dan wel in de genoemde vakgebieden werkzaam zijn; leden en niet-leden ontmoeten elkaar in de wandelgangen. Ook kunt u de stands van enkele exposanten bezoeken.

Organisatiecomité

Cees Diks, Jacqueline Meulman, Gerrit Stemerdink en de sectiecoördinatoren.

Informatie

Cees Diks, e-mail ,

Van bezoekers die niet, of niet tijdig, zijn aangemeld zal een bijdrage in de kosten van 30 euro worden verlangd!

Lunch en borrel

De lunch kan op eigen kosten op de locatie – of buiten de locatie – worden genuttigd. De slotborrel wordt aangeboden door de Vereniging.

Taal

De voertaal voor de algemene gedeelten is Nederlands. De meeste lezingen

De Dag voor Statistiek en Besliskunde 2011 wordt mede mogelijk gemaakt door

2

3

zijn echter in het Engels.

Cosinus Computing BV, Drunen.

PROGRAMMA

DAG VOOR STATISTIEK EN BESLISKUNDE donderdag 3 maart 2011 in Utrecht

Plenaire lezingen 2011

Maurizio Vichi

La Sapienza University of Rome

09.30 – 10.00 Koffie en thee

Research patterns, trends and new challenges in cluster analysis

10.00 – 10.05 Opening Dag voor Statistiek en Besliskunde 2011 (zaal 417)

Methodologies for cluster analysis are among the most well-known and ap-

10.05 – 11.00

Lezing door Maurizio Vichi (zie pagina 5)

years they have been increasingly applied in new disciplines and frequently

11.00 – 11.10

Pauze

11.10 – 11.30

Uitreiking VVS-Scriptieprijs met lezing door de prijswinnaar

data mining and pattern recognition. In this presentation we show new sta-

11.30 – 12.30

Algemene Ledenvergadering

lustrated methods have in common the statistical approach of formulating a

12.30 – 13.30

Lunch

preciated statistical techniques of multivariate analysis. In the last twenty almost reinvented in many area of research such as computer science, en-

gineering, bioinformatics and in specific fields including machine learning, tistical research patterns and trends in methodologies for clustering. The ilmathematical model, estimating its parameters and finally fitting the model to data.

13.30 – 15.30 Parallelsessies (zie pagina 9 e.v.) De parallelsessies worden georganiseerd door de Secties van de VVS-OR. Ook is er een speciale sessie, georganiseerd door het Centraal Bureau voor de Statistiek 13.30 – 14.30 Economische Sectie (zaal 415), Nederlands Genootschap voor Besliskunde (zaal 416) en Sociaal-Wetenschappelijke Sectie (zaal 417) 14.30 – 15.30 Centraal Bureau voor de Statistiek (zaal 415), Sectie Mathematische Statistiek (zaal 416) en Biometrische Sectie (zaal 417) 15.30 – 15.45

Pauze met koffie en thee

15.45 – 16.45

Lezing door Jerome Friedman (zaal 417, zie pagina 8)

16.45 – 17.30

Borrel 4

The presentation is divided in three parts: single clustering of a set of units, multi-partitioning of a set of units and a set of variables and clustering of longitudinal multivariate observations.

Model-Based partitioning and hierarchical clustering

The cluster analysis problem of partitioning or hierarchical clustering a set

of units from dissimilarity data is here handled with the statistical modelbased approach of fitting the ‘closest’ classification matrix to the observed dis-

similarities. A classification matrix represents a clustering model expressed in terms of dissimilarities.

Three models for partitioning a set of units from dissimilarity data, are illustrated and their estimation – via least-squares – is given together with new

fast coordinate descent algorithms. Following the same statistical fitting approach a new model for hierarchical clustering from dissimilarity data is also illustrated.

5

Bi-partitioning, multi-partitioning, clustering and disjoint principal component analysis

New methodologies for two-mode (units and variables) multi-partitioning

of two way data are presented. In particular, by reanalyzing the double kmeans, that identifies a unique partition for each mode of the data, a relevant

extension is discussed which allows to specify more partitions of one mode, conditionally to the partition of the other one. The performance of such gen-

eralized double k-means has been tested by both a simulation study and an

application to gene microarray data. Clustering and disjoint principal component allows to identify a partition of the units and a partition of the variables

together with a principal component for each class of the partition of the variables. This technique can be seen as a special case of the bi-partitioning. Clustering longitudinal multivariate observations.

is estimated in the metric space specified by trend, velocity and acceleration. An application is given to show the performances of the methodology.

References Martella F., Alfò M., Vichi M. (2010). Hierarchical mixture models for biclustering in microarray, Statistical Modelling. To Appear. Rocci R., Vichi M. (2008). Two-mode Multi-partitioning. Computational Statistics & Data Analysis, vol. 52, pp. 1984-2003 ISSN: 0167-9473. Vicari D., Vichi M. (2011). On Multivariate Linear Regression for Heterogeneous Data, Submitted. Vichi M., Saporta G. (2009). Clustering and Disjoint Principal Component Analysis. Computational Statistics & Data Analysis vol. 53; p. 3194-3208, ISSN: 0167-9473, doi: 10.1016/j.csda.2008.05.028 Vichi M. (2008) Fitting Semiparametric Clustering Models to Dissimilarity Data, Advances in Data Analysis and Classification, vol. 2, 2, 121-161. Vichi M. (2011) Fitting Hierarchical Clustering Models to Dissimilarity Data, Submitted.

Longitudinal multivariate data involve repeated observations of different

features of the same statistical units over a period of time. The aim is to study the developmental trends of the units across at least a part of their life span.

The dynamic evolution of the partitions of units along time is in this presentation studied in an unsupervised clustering context using a model based

clustering approach. A clustering together with a vector autoregression VAR(P) model -where P is the lag length of the VAR- are combined into a new technique that identifies an homogeneous partition in G classes for each

time t and the autoregressive dynamic evolution of the clusters. The proposed clustering/VAR model can be used also to forecast a partition at time T+1. The

parameters of the model are estimated both in a least-squares and maximum

likelihood framework and efficient recursive algorithms are given. A simulation study together with some applications of the proposed methodologies

are shown to appreciate performances of the models and the quality of its estimates.

Maurizio Vichi (Rome, 1959) is professor of Statistics at Sapienza University of Rome, Department of Statistics. He is president of the Italian Statistical Society, editor of the Springer Journal Advances in Data Analysis and Classification, the International Springer Series Classification, Data Analysis and Knowledge Organization and the new Springer Series in Statistics Studies in Theoretical and Applied Statistics founded together with French, Spanish and Portuguese Statistical Societies. He was associate professor of statistics at the University G. D’Annunzio of Chieti from 1992 to 1999 and researcher at Sapienza University of Rome from 1990 to 1991. He was research fellow at Rutgers University (USA, 1986), St. Andrews University (UK, 1984), and visiting professor at CNAM (Paris 2006), Université Paris Dauphine (Paris 2010). He has been editor of

In the final part of the presentation, similarities between trajectories describing histories of units are studied. Trend, velocity and acceleration are three

characteristics of trajectories considered to assess pairwise dissimilarities

between trajectories. The Tucker model for three-way data, modified for clus-

tering units together with a dimensional reduction of the observed variables,

6

the Springer Journal Statistical Methods and Applications from 2000 to 2006, president of Cladag from 1997 to 2007 and Secretary General of the Italian Statistical Society from 1998 to 2002. His statistical interests are in the areas of multivariate statistics, data analysis, three-way data as well as model based classification and clustering and mixture models.

7

Jerome H. Friedman

Programma’s van de secties

Stanford University

Statistical Learning with Large Numbers of Predictor Variables

voor abstracts zie pagina 12 en verder

Many present day applications of statistical learning involve large numbers of predictor variables. Often that number is much larger than the number of

cases or observations available to train the learning algorithm. In such situ-

ations traditional methods fail. Recently new techniques based on regulari-

Economische Sectie

in these settings. This talk will describe the basic principles underlying the

13.30-14.00 Adam Booij, Top Institute for Evidence Based Education

sparsity of the predicting model. The potential merits of these methods are

Overconfidence and student choice: an intake experiment at

zation have been developed that can often produce accurate learning models method of regularization and then focus on those methods exploiting the then explored by example.

Research (TIER) UvA FEB-UvA

14.00-14.30 Lex Borghans, Maastricht University and IZA, Trudie Schils, Maastricht University

The Leaning Tower of PISA; the effect of test motivation on scores in the international student assesment

Jerome H. Friedman is Professor Emeritus of Statistics, Stanford University. He received both bachelor’s and Ph. D degrees in physics from the University of California, Berkeley.



He was leader of the Computation Research Group at the Stanford Linear Accelerator Center from 1972 through 2006. He was Professor of Statistics, Stanford University, from 1982 through 2006, and served as Department Chair from 1988 through 1991. His primary interests center on machine learning and data mining. He has authored or coauthored over 100 papers in major statistical journals as well as three books on Data Mining, and has invented or co-invented several widely used data mining procedures. He has been awarded several honors including member of the National Academy of Sciences, Fellow of the American Academy of Arts and Sciences, Fellow of America Statistical Association, American Statistical Association Statistician of the year (1999), Association for Computing Machinery Data Mining Lifetime Innovation Award, the Emanuel and Carol Parzen Prize for Statistical Innovation, the Institute of Mathetical Statistics Reitz and Wald Lectures, American Statistical Association Noether Senior Lec-

Nederlands Genootschap voor Besliskunde 13.30-14.00

Jacob Jan Paulus, Consultant CQM

Het belang van modelleervaardigheden en implementaties in het OR-onderwijs



turer, and paper of the year JASA 1980 and 1985, and Technometrics 1988 and 1992.

8

9

Sociaal-wetenschappelijke Sectie thema

Het kennisniveau van aankomende studenten.

Het universitair onderwijs bevindt zich in een soort spagaat. Enerzijds veran-

Centraal Bureau voor de Statistiek 14.30-15.00 Martijn Tennekes, Piet Daas & Edwin de Jonge, Centraal Bureau voor de Statistiek, Divisie Methodologie en Kwaliteit

dert de vooropleiding regelmatig, waardoor het instapniveau steeds naar be-

dat de aankomende student hetzelfde niveau heeft als de student van tien

15.00-15.30 Jan van der Laan, Centraal Bureau voor de Statistiek, Divisie

internationaal niveau, waardoor het uitstapniveau steeds naar boven bijge-

neden bijgesteld moet worden. De docent kan er niet voetstoots vanuit gaan of twintig jaar geleden. Anderzijds moet de universiteit competitief zijn op steld moet worden. Aan pas-gepromoveerden worden vaak zeer hoge eisen

Visual profiling of large statistical datasets

Methodologie en Kwaliteit

Imputatie van afgeronde gegevens

gesteld, zoals veel gepubliceerd hebben, onderwijservaring hebben en prij-



zen gewonnen hebben. In dit kader heeft de Sociaal Wetenschappelijke Sectie twee sprekers uitgenodigd die vertellen over het kennisniveau van aankomende studenten: nationaal en internationaal. De focus ligt op wiskunde en statistiek in het voortgezet onderwijs, de belangrijkste basis voor het statistiekonderwijs op de universiteit. 13.30-14.00

Peter Kop, ICLON, Universiteit Leiden

Statistiek in het voortgezet onderwijs, op dit moment en in de toekomst (na 2014)

14.00-14.30 Robert Zwitser, Cito en Universiteit van Amsterdam

Sectie Mathematische Statistiek th eme

Estimation under monotony restrictions

14.30-15.00

Piet Groeneboom

15.00-15.30

Rik Lopuhäa

Monotone hazards and the bootstrap

The limit distribution of the L∞-error of isotonic estimators

Een internationale vergelijking van wiskundekennis in het



voortgezet onderwijs met PISA



Biometrische Sectie 14.30-15.00 Jacob van Eeghen, Stedelijk Gymnasium Leiden

Kansrekening en statistiek op het vwo

15.00-15.30

Theo Stijnen, Leids Universitair Medisch Centrum

De master statistical science

 10

11

ABSTRACTS Adam Booij, Top Institute for Evidence Based Education Research (TIER) UvA Overconfidence and student choice: an intake experiment at FEB-UvA It is well-known that many people are overconfident and that this can lead to

Lex Borghans, Maastricht University and IZA Trudie Schils, Maastricht University The Leaning Tower of PISA. The effect of test motivation on scores in the international student assessment

inefficient decisions. In the context of Higher Education (HE) this could mean

International assessments like PISA are widely used as an instrument to

will enroll in university and subsequently drop out. Indeed, many Higher

what extent PISA really picks up the knowledge and skills of students and

that if young people overestimate their probability to graduate, too many Education institutions that have non-selective admission – including our faculty – face the problem that a great number of students switch to another

field (or drop out) in the first year. This is inefficient for both the student and the institution.

To examine whether information provision helps to reduce this inefficiency in the context of schooling decisions, we conducted a randomized experi-

ment. A quarter of the applicants to our undergraduate program in econom-

ics and business was given objective information about past graduation rates of students with their characteristics (same gender, grade point average and mathematics grade in high school). Another quarter was invited for a

personal interview where success rates and motivation were discussed at a general level. The third quarter went through a realistic course day in which

the graduation rates of students with similar characteristics was derived and

made very explicit. Finally there was a control group of students that were enrolled directly without additional treatment. Preliminary results suggest

that only a realistic experience prior to enrollment affects students’ enrollment choices.

compare the output of educational systems. An important question is to

to what extent students’ motivation to perform well on this test affects the results. This paper (1) investigates student and test characteristics that

explain test motivation and (2) investigate to what extent PISA scores as published in the league tables are correlated with the average country-specific

test motivation. We utilize the random variation in the order of the questions in the 13 different booklets of the test to identify the change in performance during the test (PISA 2003-2006).

A robust finding of the analyses is that test performance decreases when the

test progresses. The test performance drop is approximately linear and esti-

mates show substantial variation between and within countries. Moreover, the test performance drop increases with ability, is higher among girls than among boys and decreases when students devote more time to studying in

general. The test performance drop relates to some of the big five personality traits, mainly to agreeableness, and that it predicts outcomes in later life such as income and smoking in addition to the pure test score. This suggests

the existence of a non-cognitive effect on test performance, apart from the

cognitive part. Finally, the correlation between the test performance drop and country’s overall test scores is very high and the non-cognitive effect can explain about 50 percent of the total variation in these test scores.

 

12

13

Jacob van Eeghen, Stedelijk Gymnasium Leiden Kansrekening en statistiek op het vwo In de voordracht wordt een overzicht gegeven van wat er op het vwo aan-

Peter Kop, ICLON, Universiteit Leiden Statistiek in het voortgezet onderwijs, op dit moment en in de toekomst (na 2014)

kansrekening en statistiek wordt onderwezen. Daarbij wordt onderscheid

Peter Kop bespreekt hoe wiskunde momenteel gegeven wordt in Nederland

kunde en die uit het recente verleden. Aan de hand van eindexamenvragen

examenopgaven, bespreekt hij gebruik van de grafische rekenmachine en

gemaakt tussen de huidige programma’s voor de verschillende soorten wis-

krijgt u een indruk van het niveau waarop de leerlingen de diverse onderwerpen moeten beheersen.



Piet Groeneboom Monotone hazards and the bootstrap About forty years ago, at the start of my career, a well-known statistician

en welke onderwerpen er worden behandeld. Ook geeft hij voorbeelden van

computers, en bespreekt hij de nieuwe plannen van de Commissie Toekomst Wiskundeonderwijs (cTWO).



Jan van der Laan, Centraal Bureau voor de Statistiek, Divisie Methodologie en Kwaliteit Imputatie van afgeronde gegevens

told me that isotonic regression was a dead subject. Twenty years ago,

In enquêtes gebeurt het regelmatig dat personen hun antwoord afronden.

Around the same time Apple computer was declared dead by the Micro-

heidsduur gevraagd. Uit de gegeven antwoorden blijkt duidelijk dat perso-

another well-known statistician told me that the bootstrap was dead. soft following community. So, somewhat appropriately, I recently used my

Apple computer to resurrect isotonic regression and the bootstrap from their graves to perform a danse macabre. This after a visit to an Oxford

based statistician studying aging dinosaurs. Perhaps Apple computer, the

bootstrap and isotonic regression aren’t as dead as some people want us to believe.

Zo worden personen in de Enquête Beroepsbevolking naar hun werkloosnen hun antwoord afronden naar hele of halve jaren. Een andere voorbeeld

is inkomen waar personen afronden naar veelvouden van 100 of 1000. Door afronding kunnen statistieken gebaseerd op deze gegevens een verteke-

ning vertonen. Om voor afronding te corrigeren is een model afgeleid dat de waargenomen gegevens kan beschrijven. Dit model bestaat uit twee delen. Ten eerste, een model voor de werkelijke onderliggende verdeling (deze wordt niet waargenomen) en ten tweede, een model voor het afrond-



mechanisme. Nadat het model geschat is op de data, wordt dit vervolgens gebruikt voor multiple imputatie: aan de hand van het model voor het af-

rondmechanisme worden personen geselecteerd die hebben afgerond. Voor

deze personen worden stochastisch nieuwe waarden geïmputeerd. Zowel 14

15

het model als de toepassing ervan op de Enquête Beroepsbevolking zullen worden gepresenteerd.

len studenten elkaar aan het vak te volgen. Jaar op jaar zit de collegezaal vol met enthousiaste studenten van Technische Wiskunde, Technische

Bedrijfskunde, Informatica en Werktuigbouwkunde. Wat maakt dit vak zo 

Rik Lopuhäa The limit distribution of the L∞-error of isotonic estimators Let f be a non-increasing function defined on [0,1]. Under standard regularity conditions, we derive the asymptotic distribution of the supremum distance between f and its isotonic estimator on any interval (αn, 1 - αn] ⊂ [0,1], where

αn tends to zero at a suitable rate. The rate of convergence of the supremum distance is found to be of order (log n/n)1/3 and the limiting distribution turns

populair?



Theo Stijnen, Leids Universitair Medisch Centrum De master statistical science

Geen abstract ontvangen.



estimator of a decreasing density, the least squares estimator of a monotone

Martijn Tennekes, Piet Daas & Edwin de Jonge, Centraal Bureau voor de Statistiek, Divisie Methodologie en Kwaliteit Visual profiling of large statistical datasets

sored observations. (Joint work with Fadoua Balabdaoui (Paris), Cécile Durot

National Statistical Institutes often have to deal with large datasets, such as

out to be Gumbel with a parameter depending on a functional of f and f’. The

results are obtained in a general framework, which includes the Grenander regression curve or an isotonic estimator of a decreasing hazard of right-cen(Orsay) and Vladimir Kulikov (ASR))

administrative data and data collected by large surveys. Before these data

sources enter the statistical process, they are usually first checked at a techni

cal level followed by a more detailed study of the data.

The quality assessment at the technical level starts with several technical

Jacob Jan Paulus, Consultant CQM Het belang van modelleervaardigheden en implementaties in het OR-onderwijs Aan de Universiteit van Twente wordt in het vak Optimization Modeling al jaren nadruk gelegd op implementatie van modellen. Steeds weer beve-

16

checks, such as the readability and convertibility of the data file. The next step

is to investigate the representations and distributions of the values, and to look for strange data patterns. This stage is often called data exploration or

data profiling. At Statistics Netherlands, this inspection is usually restricted to a visual inspection of the data in the first 100 records in tabulated form. For this stage, it would be useful to obtain a first impression of the dataset as a

whole in a single figure. A visualisation method that we think is very capable for this task, is a tableplot.

17

A tableplot is a spreadsheet-like plot of the data. It is capable of showing a

dozen of variables, both numeric and categorical, at once. Each column represents a variable and each row represents an aggregation of a certain number

of records. For each numeric column, a bar chart of the mean values is de-

picted. For categorical columns, a stacked bar chart is depicted according to the proportion of categories within the aggregation groups.

By using a tableplot, analysts are able to observe the relationships between

the variables, discover strange data patterns, and examine the occurrence and selectivity of missing data. Tableplots are preferably used interactively. Users

are able to adjust the number of aggregation groups, sort the data by one or more columns, and zoom in on the data.

The strength of tableplots in comparison to other visualisation methods is

that in one single plot, the value distributions of multiple variables are shown

in relation with each other, with missing values taken into account. We will discuss the use of tableplots in data quality assessment, and show some of the

results obtained when applying tableplots to a number of Dutch administrative sources and statistical survey datasets. 

Robert Zwitser, Cito en Universiteit van Amsterdam Een internationale vergelijking van wiskundekennis in het voortgezet onderwijs met PISA Robert Zwitser analyseert data voor het Programme for International Student

Assessment (PISA). Hij maakt een internationale vergelijking van wiskundekennis in het voortgezet onderwijs.

 18

19

20

DAG voor STATISTIEK en BESLISKUNDE. donderdag 3 maart Jaarbeurs (Beatrixgebouw), Utrecht. Jerome Friedman. hoofdsprekers

Recommend Documents