Hierarchical clustering of product transition strategies based on symbolic trend representation in a multi-product process
Technológiai-üzemeltetési stratégiák csoportosítása hisztorikus id sorok szimbolikus epizód reprezentációján alapulva
B. Balasko, S. Nemeth and J. Abonyi University of Pannonia, Institute of Chemical and Process Engineering P.O.Box 158., Veszprem, Hungary
Összefoglaló kivonat Az adatelemzést gyakran összekapcsolják statisztikai eszközök alkalmazásával, melynek oka azok könny alkalmazhatóságában és a nagy mennyiség adat gyors kezelésében keresend . Az id sorok min ségi elemzési módszerei következésképpen mindig kiegészülnek valamilyen adatredukciós technikával, mint például a f komponens elemzés a dimenziók vagy szegmentálás az adatok mennyiségének csökkentésére. Jelen cikkünkben mindkét technikát alkalmazzuk annak érdekében, hogy a nagy mennyiség , többdimenziós adatainkat olyan formába hozhassuk, amelyek már alkalmas min ségi vizsgálatok elvégzésére. Az irodalomban fellelhet számos szegmentálási technika közül a kevéssé elterjedt háromszög epizódokat alkalmaztuk az adatsorok grafikus reprezentálására, majd az így kapott epizód szekvenciákat a bio-informatikában már elterjedten használt szekvencia-illesztés segítségével hasonlítottuk össze. Ennek eredményeként a technika képes nem felügyelt módon összehasonlítani két id sort illetve több id sor esetén azokat hasonlóságuk alapján csoportosítani is. A háromszög-epizód szegmentálás lényege, hogy egy id sor értékeib l annak els és második deriváltját is felhasználva olyan tripleteket állít el , amelyben bármely érték el jelének megváltozása a trend jellegének megváltozására, azaz egy epizód végére utal. Másképpen fogalmazva, egy id sort annak zérus helyei, széls értékei és inflexiós pontjai alapján szegmentál epizódokra, a deriváltak el jelváltásai alapján pedig hét elemi háromszög epizódot definiál. Ezt a hét alaptípust {A,B,C,D,E,F,G} bet kóddal leírva kapható meg egy id sor epizód szekvenciája, valamint ha ezeket tovább osztjuk id tartam és megváltozás mértéke alapján {kis, közepes,nagy} értékekre, akkor 57-féle „szimbólum”-ot nyerünk egy trend reprezentálására. A szekvencia-illesztés egy dinamikus programozási mátrixon alapuló technika, melyet eredetileg DNSláncok összehasonlítására fejlesztettek ki, és egy olyan távolság mátrixon alapul, amelynek koordinátái az egyes fehérjék egymástól való transzformációs távolságát tartalmazzák. Azaz két DNS-lánc távolsága (optimális illesztés mellett) az ket alkotó gének transzformációs távolságának összegei közül a minimális. Ez alapján két karakterlánc távolsága is definiálható, ha ismert azok távolságmátrixa. Mi ennek reciprokával, a két szekvencia hasonlóságával számoltunk, ezáltal, két epizód szegmensekre bontott id sor hasonlóságát egyetlen számban kifejezve azok optimális illesztése értékelhet . Ezt a technikát a TVK NyRt. Polipropilén-IV üzemekben gy jtött adatokon teszteltük. Bemutattuk, hogy egy m köd , többtermékes technológia termékváltási adatait vizsgálva, az egyes termékváltások mennyire hasonlítanak egymáshoz, illetve milyen módon csoportosíthatóak a trend alakja alapján, ezáltal min ségükben kerültek összehasonlításra. Az eredményeket egy dendrogramban foglaltuk össze, amelyen jól látható, hogy az azonos termékr l azonos termékre váltások sem történtek egymáshoz jelent s mértékben hasonlóan, ezzel egy elméletileg optimális, de plauzábilis termékváltási profil megalkotása a következ lépés.
Introduction There are several ways in the data mining literature to analyze the large amount of time series. Quantitative methods are widely spread because of their statistical nature, it is always easy to calculate the average, the covariance or the quantiles from a data set but it always claims prior knowledge to analyze the results, i.e. to use these basic statistics of a time series for qualitative analysis. A common method for decreasing the size of a data set and to get qualitative instead of quantitative information is time series segmentation. Segmentation means finding time intervals where a trajectory of a state variable is homogeneous [1]. Segments can be linear, steady-state or transient, indicative for normal, transient or abnormal operation. Cheung and Stephanopoulos in [2] proposed a second order segmentation method for process trend analysis, the application of episodes with a geometrical representation of triangles. Triangular episodes use the first and second derivatives of a trend on a geometrical basis, hence seven primitive episode can be achieved as characters. Lot of researchers in the literature applied modified episode segmentation, Wong et. al [3] extended the seven-symbol-set to 57 episodes by fuzzifying each episode by duration and magnitude as {small, medium, large}. To extract useful feature from time series of the state variables one needs to lower the size and dimension of the data and define a distance measure from a theoretically optimal solution to help operators in their work. This article proposes principal component analysis (PCA) aided triangular episode segmentation to represent multivariate product transitions as a 1-D symbolic trend. For sequence comparement, in [4] it was shown as an example that dynamic time warping (DTW) is able to compare DNA sequences if mutation weights (as distances) exist. Going towards this dynamic alignment technique, we applied global pairwise sequence alignment, a
well-known technique in bioinformatics developed by [5], to handle not only mutation and substitution but injection and deletion operators in a sequence. We applied the proposed technique to compare, qualify and cluster the episode sequences of 1-D product transitions based on process data achived from an existing multiproduct polymer plant. The following sections deal with a shortened overview of the background of our methodology, that was described in our former work [6]. The case study is based on data collected from TVK Plc., Hungary. Principal Component Analysis Principal Component Analysis or hotelling algorithm is a widely known and applied method for lowering the dimensionality of a data set based on its multidimensional structure and find patterns in data [7]. During an orthogonal linear transformation from n to a lower dimension of q, PCA calculates the eigenvectors and eigenvalues of the n-dimensional preprocessed (zero prospective value) data and selects the largest q eigenvalues, which corresponding eigenvectors create a subspace, where the original data is projected into. In other words, it finds the most significant directions with the largest variance in the data set. As a formal description, let X be an n×N dimensional data set, where N means the number of observations. The aim of PCA is to find a P orthonormal n×n projection matrix that fulfills the following equation: P-1 = PT
(1)
Y = P TX
(2)
where Y is the diagonal covariance matrix of the transformed data. In general, the principal components of a data set are calculated from the correlation matrix C by the following eigenvalue equation:
Cp = p
(3)
The eigenvalues are decreasingly sorted, the eigenvector corresponding to the largest eigenvalue will be the first principal component and so on. The selected principal components are collected in P and the projected data are calculated by Equation 1. If not all the principal components are selected, the projection error of PCA (cumulative percentage of the selected eigenvalues) can be given as follows: q
i =1
λi
(4)
Episode segmentation As described in [2], to get from a quantitative to a qualitative representation of a real-valued x(t) function, it has to be reasonable function. It is clear that all the psychical variables in a plant operation are reasonable. It is considered, if we know the value and the derivatives of a reasonable function, the state of that function is completely known. The continuous state (CS) over a closed time interval can be defined as a point value, which is a triplet (if x(t) is continuous in t): CS(x,t)
PtVl(x,t) =
x(t ), x' (t ), x' ' (t ) (5)
Consequently, a continuous trend can be defined as continuous sequence of states. For discrete functions, as an approximation, an underlying continuous function has to be known since the derivatives of single points cannot be performed. These definitions lead to a qualitative description of a state (QS) and trend if x is continuous at t, otherwise it is undefined: QS(x,t) =
[x(t )], [x' (t )], [x' ' (t )]
(6)
where [x(t)],[x’(t)] and [x’’(t)] can be {-; 0; +}, depending if they have negative, zero or positive values. Obviously, a qualitative trend of a reasonable variable is given by the continuous sequence of qualitative states. QS(x; t) is called an episode if it is constant for a maximal time interval (the aggregation of time intervals with same QS), and the final
definition of a trend of a reasonable function is a sequence of these maximal episodes. An ordered sequence of triangular episodes is the geometric language to describe trends. It is composed of seven primitives noted as {A,B,C,D,E,F,G} illustrated in Fig. 1. Wong et al. fuzzified these seven primitives into a fuzzy set of 57 episodes with a fuzzy membership function [3]. Every episode is assigned to be {small(s);medium(m);large(l)} by duration and magnitude with the highest membership value. Actually, this means that they hard partitioned the episodes by values of the intersections of the membership function.
Figure 1: Seven primitive episodes proposed by Cheung and Stephanopoulos Pairwise Sequence Alignment Sequence alignment is typical expression of bioinformatics, where amino acid or nucleotide sequences have to be compared, how far the evolved new sequences are from the elders, i.e. how old they are, and how many mutation steps were needed to result in the new sequence. Applying the minimal evolution, one tries to find the least mutation steps between the elder and offspring sequence. Naive algorithms compares all possible alignments and select one with minimal sum of transformation weights. Fast algorithms calculate in an other way: Let An be a n-element sequence and Bm a m-element sequence, an and bm their nth and mth element; a α*(An;Bm) denotes the set of optimal pairwise alignments of An and Bm and w(α*(An;Bm)) the
sum of transformation weights for these optimal alignment. The basic idea in fast algorithms is that if we know w(α*(An¡1;Bm)), w(α*(An;Bm-1)) and w(α*(An-1;Bm-1)), then w(α*(An;Bm)) can be calculated within a constant time period. If we leave the last aligned pair in an optimal alignment of An and Bm then we get an optimal alignment of (An-1;Bm), (An;Bm-1) or (An-1;Bm-1), depending on that last mutation step was a deletion, injection or substitution, respectively: w(α * ( An−1 , Bm )) + w(an → −) w(α * ( An , Bm )) = min w(α * ( An , Bm−1 )) + w(− → bm ) w(α * ( An−1 , Bm−1 )) + w(an → bm ) (7) The optimal alignment weights are given in a dynamic programming matrix D with a size of (n+1)×(m+1). The initial conditions for the 0th row and column (d0,0 = 0): di,0 =
i l =1
w(al → −) ; d0,j =
j k =1
w(− → bk ) ; (8)
The above three equations are the main difference between DTW and pairwise alignment, to fill up D matrix, Eq. 7 is used. Optimal alignments are started at dn,m and ended in d0,0, while in every step the minimal weight is chosen and stepping left means an injection, stepping upwards means a deletion and stepping diagonally upwards means a substitution. This method was developed by Needleman and Wunsch[5]. We applied a scoring matrix instead of transforming weights to align our episode sequences, the only difference is that it maximizes the score of alignment in every step, and the optimal alignment will be the path with the maximal score. The proposed algorithm The implemented algorithm has a detailed description in [6], we hereby present only a short summary. The algorithm has the following basic steps:
- Preprocessing of data (PCA and/or Gaussian filtering); - Time series segmentation into a sequence of triangular episode primitives; - Partitioning of episodes by duration and magnitude; - Alignment of two episode chains; As mentioned previously, we defined a scoring matrix instead of using evolutionary transformation weights in order to get a maximized score for each alignment. Based on these alignment scores, one is able to compare and classify process trends to get a qualitative analysis. Case study This section deals with product changing strategies of a Hungarian polymer plant. The plant produces propylene polymer in two loop reactors and propylene-ethylene copolymer is produced in a gas phase reactor after the loop reactors (Himont technology). Product quality is indicated by melt flow index (MFI) that higly depends on hydrogene contentration. Production is performed in product cycles to minimize offgrade product: 6 homopolymers are produced with rising MFI (H1-H6), then 11 copolymers with decreasing MFI (C1-C11) in a cycle. This strategy suggests that all H-H and C-C transitions has to be similar by shape, H-C and C-H transitions may form different groups. Product transitions are performed manually by tuning of five different process values done by operators: (i) hydrogene inlet concentration in 1st and (ii) 2nd loop reactors, (iii) reactor temperature in 1st and (iv) 2nd reactors, (v) catalyst inlet flow into the 1st reactor. While episode segmentation can only handle 1-D trends, every 5-D product transition is projected with the same principal component into a 1-D data series as a function of time with an eigenvalue percentage of 98.8% (1.2% information lost). A number of 31 transitions were compared based on the score of alignment of 2-hour-trends.
A dendrogram of the clustering results can be seen in Fig. 2, where distances were defined as (score)-1. As one can see in the middle section of the figure, only a part of copolymer transitions could be grouped together as transition trajectories with similar shape. This type of analysis showed that not even the same type of transitions were performed similarly during product changes (see Fig. 3), hence a theoretically optimal – feasible strategy could result in a more efficient production. C9C10 H1H2 C8C9 C5C6 C11H1 H6C1 H5H6 C4C5 H3H4 C7C8 C6C7 C3C4 H1H2 C2C3 C1C2 C10C11 C5C6 C1C2 C7C8 C10C11 C8C10 H2H3 H5H6 C2C3 C11H1 C6C7 H4H5 C3C5 H2H3 H6C1 0.12
0.1
0.08
0.06
0.04
0.02
Figure 2: Hierarchical clustering of production smC.smD.smA.smB.smC.smD.smA.smD.smA.smD.smA.ssD.slA.smB.smC.slB.smC.slB.smC.smD.slA..mG. 0,005 0 -0,005 smD. smB. ssD.slA. smB. smC. smA. smC. smD. smA.smD.smA. smD. smA. smC. slB. smC. slB.
-0,01
0
50
100
150
200
250
300
350
smC. smD. slA. .mG.
400
450
smC.smB.slB.slC.slD.smD.slA.slB.smC.ssB.smB.slC.smD.smA.smB.slC..sG.
0.006
0
smB.slB. smC.
-0,006
0
50
slC.
slD.
100
smD. slA.
150
200
slB.
250
ssB. smA.smB. slC. .sG. smC. smB. slC. smD.
300
350
400
450
transitions Figure 3: Comparement of two C1C2 transitions CDABCDADADADABCBCBCDA--G | :|| | | |||| |||| | C-BBC--D-D--ABCB-BCDABCG Figure 4: Primitive episode alignment of the trends on Fig.3. with an identity percentage of 58 %
Conclusions The proposed algorithm can extract useful information from quantitative time series. For decreasing the large amount of data and for resulting in a qualitative description of trends, it applies PCA aided triangular episode segmentation. It has been shown that this tool is able to find similarities in two trends and compare them based on pairwise sequence alignment. Towards this goal, one was able to search for similarities in trends, hence classify process trends based on their shape encoded into episode sequence. The authors would like to acknowledge the support of the Cooperative Research Centre (VIKKK) (project 2004-I) and Hungarian Research Found (OTKA T049534). References [1] Keogh, E., Chu, S., Hart, D. and Pazzani, M.: An Online Algorithm for Segmenting Time Series, IEEE International Conference on Data Mining (2001) [2] Cheung, J. T., Stephanopoulos, G.: Representation of process trends. Part I. A formal representation framework, Computers and Chemical Engineering, 14, 495-510 (1990) [3] Wong, J.C., McDonald, K.A. and Palazoglu, A.: Classification of process trends based on fuzzified symbolic representation and hidden Markov models, J. Proc. Cont., 8(5-6), 395-408 (1998) [4] Srinivasan, R., Qian, M.S.: Online fault diagnosis and state identification during process transitions using dynamic locus analysis, Chemical Engineering Science, 61, 6109-6132 (2006) [5] Needleman, S.B, Wunsch, C.D.: A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology, 48, 443-453 (1970) [6] B. Balasko, S. Nemeth and J. Abonyi, Qualitative Analysis of Segmeted Time Series by Sequence Alignment, 7th International Conference of Hungarian Researchers on Computational Intelligence, Budapest (2006) [7] Smith, L.I., A tutorial on Principal Component Analysis (2002)