IN3405 Bachelorproject - Final Report chainels - Network Visualisation
Hans van den Bogert (1307983) André Vollering (1516485)
Christiaan Titos Bolivar (4009010)
Technical University Delft, Faculty Electrical Engineering, Mathematics and Computer Science Date: 12-07-2012 Company: chainels TU Delft Supervisor: G. Gousios chainels Supervisors: E. Buckers, V. Koeman & Dr.Ir. S.I. Suddle Coordinator BSc Project: H.G. Gross
i
Acknowledgements The last few months we worked on our Bachelor project for Computer Science at the TU Delft. The main objective of this project is letting students experience how it is to work and experience the whole progress of software development. We did our project at chainels a startup company, that is developing a new business to business social media platform. We worked full time on this project for 12 weeks, starting in late April and ending mid Juli 2012. This report is the result of our work. We want to thank everyone on the chainels team for their support; especially Erwin Buckers and Vincent Koeman, the two main developers who guided us through the process. Also we want to thank our supervisor, Georgios Gousios, for his comments and thoughts on our progress.
ii
Nomenclature BCR
Business Chain Recommendation
chain
A connection between two companies
FCR
Friend Chain Recommendation
RPC
Remote Procedure Call – Invoking a procedure i.e. function, on a remote machine and capture its output to be handled on the local machine.
SimpleTest
A framework for php unit tests and web tests
SME
Small and medium enterprises
iii
Contents 1 2
3 4
5
6
7
8 9
10 11 A B C D E F
Introduction . . . . . . . . . . . . . . . . . Problem Analysis . . . . . . . . . . . . . . 2.1 Current Situation . . . . . . . . . . . . 2.2 Problem Description . . . . . . . . . . 2.3 Planning . . . . . . . . . . . . . . . . . Requirements . . . . . . . . . . . . . . . . 3.1 Functional Requirements . . . . . . . . 3.2 Nonfunctional Requirements . . . . . . Design . . . . . . . . . . . . . . . . . . . . 4.1 Visualization . . . . . . . . . . . . . . 4.2 Recommendation . . . . . . . . . . . . 4.3 Class Diagrams . . . . . . . . . . . . . Implementation . . . . . . . . . . . . . . . 5.1 Visualization . . . . . . . . . . . . . . 5.2 Recommendation . . . . . . . . . . . 5.3 SIG Feedback . . . . . . . . . . . . . . Testing . . . . . . . . . . . . . . . . . . . 6.1 Unit Tests . . . . . . . . . . . . . . . . 6.2 Integration Tests . . . . . . . . . . . . 6.3 Acceptance/User Tests . . . . . . . . . Recommendation Performance Evaluation 7.1 Creating a Test Set . . . . . . . . . . . 7.2 Testing our Algorithms . . . . . . . . . 7.3 Accuracy Results . . . . . . . . . . . . 7.4 Running Time Results . . . . . . . . . 7.5 Conclusions . . . . . . . . . . . . . . . Requirements Evaluation . . . . . . . . . 8.1 Checklist . . . . . . . . . . . . . . . . Recommendations for Future Work . . . . 9.1 System Extensions . . . . . . . . . . . 9.2 Optimization . . . . . . . . . . . . . . 9.3 Recommendations for CHAINels . . . Conclusion . . . . . . . . . . . . . . . . . Project Log . . . . . . . . . . . . . . . . . 11.1 Process . . . . . . . . . . . . . . . . . 11.2 Logs . . . . . . . . . . . . . . . . . . . Screenshots . . . . . . . . . . . . . . . . . Problem Description . . . . . . . . . . . . Plan of Approach . . . . . . . . . . . . . . Orientation Report . . . . . . . . . . . . . Requirements Analysis & Design . . . . . SiG Feedback . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1 2 2 2 3 4 4 5 6 6 7 10 13 13 15 16 19 19 20 20 21 21 21 22 23 24 25 25 27 27 27 28 30 31 31 31 36 39 42 47 64 83
iv
Summary chainels is a new social network site that focuses on business-to-business relations. chainels found that many companies would be interested in learning more about their business network. A visualization of a company network can provide the user with a clear and simple overview of its network. Our project consisted of creating this network visualization. The visualization should also show the recommended chains of a company. Because the current recommendation algorithm does not provide optimal recommendations (it doesn’t take any business logic into account), we also have to design a new recommendation algorithm. We defined three types of visualization for a company: a geographical visualization, a business chain visualization, and statistics. The geographical visualization consists of a map with markers for each company, and lines between those companies representing the chains. The user can filter the companies on industry, chain type and chain creation time. The business chain visualization shows a tree with the companies grouped on either chain type or industry. The statistics show a line graph which shows chains over time, and two pie-charts showing the distribution of chain types and industries. For the implementation of the visualization we used the d3.js javascript library[4]. D3 creates svg elements based on data, and for each element a behavior can be defined using D3. This results in an interactive application. We split the recommendation algorithm into two parts: friend chain recommendations (fcr) and business chain recommendations (bcr). The reason behind this is because these are two very different types of relationships, that don’t necessarily have anything to do with each other. For the fcr algorithm we use the Adamic/Adar scoring function to compare two companies. For the bcr algorithm we defined our own score function, based on supply and demand, industry, and distance. Testing our recommendation algorithm we found that it performs considerately better than the current recommendation algorithm. The accuracy of the recommendations is up to 70% higher than the current algorithm.
1 Introduction
1
1
Introduction
chainels is a business-to-business social media platform. Where other social networks such as Facebook[5] and LinkedIn[12] focus on personal relationships, chainels focuses on companies as an entity. On chainels, companies can make chains with other companies, define supply and demand, write testimonials and more. The website is currently still in development. The chainels team is actively approaching companies (primarily focusing on sme’s) to inform them about their new social media platform. Through these talks they found that many companies like to see a visual representation of their business network. chainels themselves agreed with this and the need for a network visualization emerged. chainels also wanted to show chain recommendations in the visualization, which was deemed a logical consequence due to the social nature of the chainels website and a means to further improve the connectedness of the network. Therefore a company recommendation algorithm had to be made, which already existed in some form, but was in need of a complete makeover. Our project consisted of making a visualization of a company network. We defined three main visualizations, each providing the user with a useful, interactive overview of its network. We also designed a new chain recommendation algorithm, aimed at providing relevant business recommendations to the user, as well as providing friend recommendations. The structure of this report is as follows: Section 2 consists of a more detailed description of our assignment and why improvement was needed. In Section 4 we discuss the design of our system and in Section 5 we discuss how we implemented it. In Section 6 we explain how we tested our algorithms and in Section 7 we discuss the results of those tests. And finally, Section 9 consists of a number of recommendations for future work.
2 Problem Analysis
2
2
Problem Analysis
2.1 2.1.1
Current Situation Team
chainels started out as an idea of dr.ir. S.I. Suddle, and the site was developed as a Bachelorproject by Erwin Buckers and Vincent Koeman. Their project took place in the summer of 2011, and they have continued working on chainels since then. The persons we had contact with and/or worked with on this project are: dr.ir. S.I. Suddle, the CEO of chainels and the company contact for the TU Delft, Erwin Buckers and Vincent Koeman, the two developpers. We met with the two developpers regularly (weekly or more) to discuss our progress, decisions and questions. We also worked with Willem Buijs, the designer, to discuss the look and feel of our application. 2.1.2
System
Back-end chainels is developed in object-oriented php. The relevant classes for us are the Company and Chains classes. A Company has a number of attributes: an id, address, industry and supply & demand labels. A Chain between two companies has the attributes type (Friend, Supplier, Customer or Other), time of creation and a private flag (a private chain is invisible to all other companies). The current recommendation algorithm simply checks for chains of chains and then counts the number of occurrences of a company, returning a list of the companies with the most occurrences. This doesn’t give optimal recommendations because it doesn’t take into account any business logic. The above classes have their corresponding data saved in a database called Redis[17]. This nosql[21] database keeps all data in-memory. The data is retrieved through keys which have corresponding values. Due to low overhead and the ommittance of relatively slow persistent storage like harddisks and the near-linear scaling of Redis, it provides a future proof solution for data storage. Front-end chainels uses a BigPipe[3] system to load the webpages. With BigPipe a webpage consists of pagelets, small parts of the complete page (for example, the menu bar), these pagelets are then loaded one by one, in a certain order. The visualization and the recommendations will be located in the Chain pagelet (2). The html isn’t hand coded, it is generated with php. The idea for the network visualization was to generate a simple graph in php using graphviz[6]. The development team at chainels made a small start with this, before assigning this functionality to a bachelorproject.
2.2
Problem Description
As mentioned in the introduction, through their talks with companies, chainels found that many of those companies would be interested in seeing a visual representation of their business network. Especially small businesses like to have this information available to them, so they can show it to others, or analyze their own network. chainels doesn’t have such a visualization, but they want to have such a visualization. The new recommendation, which is needed because the current algorithm produces sub-par recommendations, should take into account business information, such as supply/demand, to be able to provide useful new business contacts or opportunities for companies. No trivial complete solutions exist and as such an algorithm must be invented for the specific case of chainels, albeit using preexisting sub-solutions.
2 Problem Analysis
2.3
3
Planning
We divided the project into 4 phases. An orientation phase, a design phase, an implementation phase and the final phase (working on report and presentation). In Figure 1 the planning is visualized in a Gantt chart. In the orientation phase we researched current visualization techniques, existing libraries we could use, and how recommendation algorithms work. This phase was concluded with the orientation report in Appendix D. The design phase consisted of defining the requirements, use-cases, and designing our recommendation algorithm (Appendix E). We then planned a 5 week implementation phase, using scrum as a development method.
Ϯ ϭ Ϭ Ϯ Ͳϰ Ͳϯ Ϯ
Ϯ ϭ Ϭ Ϯ Ͳϰ ͲϬ ϯ
Ϯ ϭ Ϭ Ϯ Ͳϱ Ͳϳ
Ϯϭ ϬϮ Ͳϱ Ͳϰ ϭ
Ϯϭ ϬϮ Ͳϱ Ͳϭ Ϯ
Ϯϭ ϬϮ Ͳϱ Ͳϴ Ϯ
Ϯϭ ϬϮ Ͳϲ Ͳϰ
Ϯϭ ϬϮ Ͳϲ Ͳϭ ϭ
Ϯϭ ϬϮ Ͳϲ Ͳϴ ϭ
&ĂƐĞ/Ͳ KƌŝĞŶƚĂƚŝŽŶ &ĂƐĞ//Ͳ ĞƐŝŐŶ &ĂƐĞ///Ͳ /ŵƉůĞŵĞŶƚĂƚŝŽŶ &ĂƐĞ/sͲ ZĞƉŽƌƚ
Fig. 1: Project planning
Ϯϭ ϬϮ Ͳϲ Ͳϱ Ϯ
Ϯϭ ϬϮ Ͳϳ ͲϮ
Ϯϭ ϬϮ Ͳϳ Ͳϵ
Ϯϭ ϬϮ Ͳϳ Ͳϲ ϭ
Ϯϭ ϬϮ Ͳϳ Ͳϯ Ϯ
Ϯϭ ϬϮ Ͳϳ ͲϬ ϯ
3 Requirements
3
4
Requirements
This section contains an overview of the requirements made in the Requirements Analysis Document (RAD). The full RAD document can be found in Appendix E.
3.1 3.1.1
Functional Requirements General
1. The company must be able to see a simple geographical graph. 2. The visualization can be toggled to full screen. 3. The user can switch between different kinds of visualizations/graphs. (if there are multiple). 4. The user can click on a company in the graph (represented as a node) and then see the business card of that company (with a link to its page). 5. On the geographical visualization the different kind of industries and the different chain types must be distinguished by color. 6. When there is an older (
Internal (analysis)
7. There are three different “main” visualizations (a) Geographical (b) Company chain (c) Statistics 8. For each main visualization, the user can choose a different view. 9. There must be a simple geographical graph. 10. Each graph has a timeline with a slider. Sliding to a different date will update the visualization to show only the chains up to that date. 11. For each graph there must be a zoom functionality. 12. The user can filter different kinds of chain types. 13. The user can see his recommended chains in the geographical graph. 14. In the statistics visualization there is a view of the distribution of chains by industries. 15. In the statistics visualization there is a view of a graph which contains the number of chains over time. 3.1.3
External
16. A visitor and a user on a company page see a geographical graph of the company network. 17. The nodes that are chained to both companies are marked.
3 Requirements
3.1.4
5
Recommended Chains
18. The recommendation algorithm must take into account the industry, supply and demand, location, chain type and possibly competition. 19. Private chains may not be used in the algorithm, to ensure privacy. 20. The recommendation algorithm must give relevant recommendations that looks further than only recommending “friends of friends”, which is the case in the current implementation.
3.2
Nonfunctional Requirements
21. The user-interface must fit in the current design of chainels. 22. The system must be able to process large amount of users (in the order of tens of thousands). 23. The graphs must be able to display large amount of companies and chains in a convenient and clear way. 24. The network visualization application must be compatible with the modern Internet browsers (IE9+, Firefox, Chrome, etc).
4 Design
4
6
Design
4.1 4.1.1
Visualization UI Design
The pagelet on which the visualization will be situated, consists of three parts: a box that shows the current chains of a company, a list of chain recommendations, and a placeholder for the visualization. This means that our visualization already had a predetermined size and location. It has to be located on the bottom of the pagelet, and it should be no more than 625 pixels wide. In our requirements document we defined the three visualizations: geographical, business chain and statistics. The business chain visualization can be grouped on chain type and on industry. The statistics consists of a line graph showing the number of created chains over time, and two pie charts that show the distribution of chain types and industries. The statistics are viewed more as a "proof of concept", that can be extended later on by chainels.
Fig. 2: Mockup of the User interface
4 Design
7
Figure 2 shows a mockup of the user interface design of the geographical visualization. The essential elements are the three buttons to switch the visualizations, the companies as nodes, the chains as edges, the filters and the timeline. All these elements are derived directly from the requirements. Appendix E contains the mockups of the other visualization types. Screenshots of the final UI design are located in Appendix A. 4.1.2
Retrieving data from the back-end
To display the different types of visualizations, data will need to be dynamically loaded into the eventual webpage to keep the application running smooth and fast. Dynamically means that no complete page will have to be loaded from the back-end server. A low-level method of doing so has been around in the browser environment for quite some time now known as Ajax[20]. Ajax is a mechanism for already loaded pages to retrieve data (and display if necessary) from a remote server on a need to have basis. So if a user chooses a different visualization, only the visualization data will need to be transferred. A more conventional way, but less efficient, would’ve been to refresh the complete page with the only difference that the type of visualization had been changed. It’s not hard to see that transferring page layout, pictures and other resource is subpar when considering responsiveness. To have a consistent manner of approaching the back-end through Ajax, a rpc-over-http api has been created, easing the burden of duplicate code between front-end and back-end by making back-end functions available through http requests. More in-depth details can be found in Section 5.1.2.
4.2
Recommendation
Designing the recommendation algorithm, we quickly found that companies on chainels have different types of relationships. The most prominent type of relationship is a business relationship, such as supplier-customer, but a company can also have "friend" relationship. For example stores that are located in the same shopping mall, or two companies in the same building. These companies would also make chains with each other, but do not do any business. To make a business recommendation, a company’s "friends" are irrelevant, and to make a friend recommendation the business relationships are irrelevant. We therefore need to make two different recommendations. To this end we designed two algorithms, a Friend Chain Recommendation algorithm (fcr) and a Business Chain Recommendation algorithm (bcr) 4.2.1
Friend Chain Recommendation
Every social network uses some kind of friend recommendation algorithm. In our orientation report (Appendix D) we researched the field of link-prediction algorithms. Link prediction algorithms compute/predict links (edges) in a graph based on observed links. A score function gives a score to a combination of two nodes (i.e.: probability of those two nodes having a link). We concluded that the Adamic/Adar[1] scoring method would be best suited since this function performed best compared to other scoring functions[10]. The Adamic/Adar score function is as follows: score(x, y) =
X
z∈Γ(x)∩Γ(y)
1 log |Γ(z)|
(1)
With Γ(x) = the neighbours of node x. The way this works is that the rarer features are weighted more heavily. Adamic/Adar who introduced this measure used it to compute how related two home-pages were. In the case of a social network recommendation these features can be seen as neighbours, so it prefers ’rare’ neighbours over common ones. Neighbours are defined as ’rarer’ if they don’t have many friends. To sum this up, the algorithm returns a high score if there are alot of common neighbours that don’t have many friends and returns a low score when there aren’t any common neighbours or there are a few common neighbours that have alot of friends.
4 Design
8
The idea behind this algorithm is that friends share common friends. People with not many friends have most of the time a better relation with their friends, this is why ’rarer’ friends are weighted more heavier. To actually calculate the recommendations for a company x, we use Breadth First Search (bfs) to walk through the graph. bfs can skip the first found layer, since those companies are already friends, and can stop after the second layer, since companies in more distant layers share no common neighbours with x, and as a result the score will always be zero. Algorithm 1 Friend Chain Recommendation (fcr) algorithm 1 function FCR( x , minDepth , maxDepth ) 2 Queue q ; 3 scores ; 4 5 q . enqueue ( x ) ; 6 mark ( x ) ; 7 8 while ! q . isEmpty ( ) 9 y = q . dequeue ( ) ; 10 i f BFSdepth <= maxDepth 11 i f y != x && BFSdepth >= minDepth 12 s c o r e s [ y ] = adamicAdar ( x , y ) ; 13 end i f 14 for a l l c h i l d r e n c in y . getChildren ( ) 15 i f ! c . isMarked ( ) 16 c . mark ( ) ; 17 q . enqueue ( c ) ; 18 end i f 19 end f o r 20 end i f 21 end while 22 return s c o r e s ; 23 end Cold-Start Problem The above algorithm only gives useful scores if there are any preexisting chains, which is not always the case. For example a newly registered user starts without any chains. We’ve decided that the recommendations will then be based upon a trivial algorithm that selects the five most geographically nearby companies. The user most likely knows these companies and are therefore adequate friend recommendations. 4.2.2
Business Chain Recommendation
In our orientation phase we found that most recommendation algorithms work purely topologically based (like Adamic/Adar), or use a content based approach such as defining feature vectors of users that can be compared. A topological approach would not be optimal for business recommendations, since it would ignore relevant business data and would also take friend chains into account. A content based approach would work better in this case. Content based approaches look at features of nodes (companies in our case) and use a score function to compare these features and decide if the two nodes are a good match. We therefore need to define a score function, to compare two companies. Which factors (features) to take into account were determined and discussed first. The most obvious factor would be supply and demand. If company x needs computers and the algorithm
4 Design
9
recommends a company y that supplies computers, it’s a good recommendation. When two companies are compared we use the following function to determine the supply/demand score: sd(x, y) = |supply(x) ∩ demand(y)| + |demand(x) ∩ supply(y)|
(2)
The second factor is the industry of a company. While this isn’t as important as supply and demand, if a trend is seen in the network that two industries α and β frequently work together, it would still be a good recommendation to a company of industry α to be introduced to another company of industry β. This can also strengthen a recommendation based purely on supply and demand. To discover a trend in the network of two industries working together ("matching") we look at the field of association rule learning. Association rules in general work in a way if for example a customer buys milk and bread, and many customers who buy milk and bread also buy eggs, then eggs are recommended. The association rule in this case would be {milk, bread} ⇒ {eggs}. Pruning association rules is then done via filtering on the "interestingness" of a rule. This is done by calculating the measure of significance. There are numerous ways of calculating this [13]. We selected the four most often used methods: Support Support is the most basic method. The score returned is the amount of occurrences proportionate to the data set. If the support of an item is 0.3, it means the items occurs in 30% of all rules in the dataset. It can be used on single items or association rules themselves. Support(X ⇒ Y ) = size ∗ P (X, Y ) and Support(X) = size ∗ P (X) Where X ⇒ Y is the association rule between items X and Y , and P (X, Y ) is the chance of X and Y occurring together in a rule. Confidence Confidence is one of the more common methods. This function is calculates a relative score. ) Conf idence(X ⇒ Y ) = P (Y |X) = Support(X⇒Y Support(X) Lift This function is one of the less common methods. Lift is based upon the idea to see how strong X influences Y. It works almost the same as the confidence rule but instead of dividing by the support of the first item it divides by the support of the first item multiplied by the support of the second item. Support(X⇒Y ) Lif t(X ⇒ Y ) = Support(X)∗Support(Y ) Conviction Conviction is also one of the less common methods and is based upon the idea that X and Y¯ shouldn’t occur together. size−Support(X) Conviction(X ⇒ Y ) = size∗(1−Conf idence(X⇒Y )) To translate the general concept of association rules to our problem of "matching" two industries, we can view a chain between two companies (and thus two industries) as an association rule. If company x of industry α has a chain with company y of industry β, the association rule becomes α ⇒ β. This simplifies the whole process of association rule mining, since we only have one-to-one rules (a company can only belong to one industry). Those rules are directly useable because they are the chains themselves. Comparing two industries we then can simply look up the "interestingness" of that potential combination and use that result as our score between two industries. Therefore our score function to compare two industries is: iScore(α, β) = interesting(α, β)
(3)
4 Design
10
Where interesting(α, β) is one of the calculation methods support, confidence, lift or conviction. Which method is chosen in the final implementation should be decided after testing. The complete score function of comparing two companies x and y then becomes: score(x, y) = gi iScore(x.industry, y.industry) + gsd sd(x, y)
(4)
where gi + gsd = 1 When two companies have the same score, we then use a third factor, distance, to sort those companies. The final bcr algorithm uses the same breadth-first-search algorithm as fcr in Algorithm 1. For the maximum depth we choose 6, because of the "six degrees of separation" principle[22].
4.3
Class Diagrams
Figures 3 and 4 show the (truncated) class diagrams. For the recommendation design, we decided to create an interface class Recommender. This provides an easy way of changing the functionality of the Recommendation algorithm if needed in the future (for example, use an alternative to Breadth First Search), without the need of complicated refactoring.
4 Design
11
pkg Recommendation
<
> Recommender + getRecommendedFriendChains(company : Company) : array + getRecommendedBusinessChains(company : Company) : array + friendScore(cid1 : String, cid2 : String) : float + businessScore(cid1 : String, cid2 : String) : float
BFSRecommender - industryweight : float - sdweight : float + breadthFirstSearchWithScore(root : string, childGetter : function, score : function, minDepth : int, maxDepth : int) : array + friendScore(cid1 : String, cid2 : String) : float + adamicAdarScore(array1 : array, array2 : array, frequency : function) : float + businessScore(cid1 : String, cid2 : String, iweight : float, sdweight : float) : float - sortOnDistance(scores : array, root : Company) : array - sortScore(same : array, root : Company) : array + distance(cid1 : String, cid2 : String) : float
AssociationDiagram
SupplyDemand
+ fillDiagram(method : String) : void + fillWithSupport(links : array) : void + fillWithConfidence(links : array) : void + fillWithConviction(links : array) : void + fillWithLift(links : array) : void + getIndustryScore(cid1 : String, cid2 : String) : float
+ supplyDemandScore(cid1 : String, cid2 : String) : float
AssociationRules + getAllLinks() : array + createLink(industrykey1 : String, industrykey2 : String, cid1 : String, cid2 : String) : array + calculateSupportRule(industrykey1 : String, industrykey2 : String, links : array) : float + calculateConfidenceRule() : float + calculateConvictionRule() : float + calculateLiftRule() : float
Fig. 3: Class Diagram for the Recommendation component
4 Design
12
pkg Visualization Ajax_Controller_AbstractController + class : String + method : String + arguments : array + __construct(class : String, method : String, arguments : array) : void + invoke() : void
Ajax_Controller_AbstractLoggedInController + cid : String + internal : boolean + invoke() : void + createAjaxClass() : void
Ajax_Controller_GraphData + createAjaxClass() : void
GraphData - cid : String - internal : boolean + getGraphData() : void
Ajax_GraphData + plainclass : GraphData + __construct(cid : String, internal : boolean) : void + getGraphData() : void
Fig. 4: Class diagram for a part of the RPC framework (not every method of GraphData is shown)
5 Implementation
5
13
Implementation
5.1
Visualization
5.1.1
Making the visualization
In the orientation phase we decided on using the d3.js (Data Driven Documents) javascript library[4] for our visualizations. D3 creates svg (Scalable Vector Graphics) elements based on data. svg is an xml based file format, used for creating static or dynamic vector images. Each svg element can be controlled exactly the same way as any other html element, with behaviour based on events such as mouse clicks, mouse overs or scrolling. It also means that each element can be styled using css. D3 lets you easily define these behaviours and styles for each element (in a similar manner as jQuery[8]). The data that D3 uses to create svg elements use is in a json-format. Making D3 work is straightforward, simply include the Javascript files in the html page, and make sure there is a div element on the page that will contain the svg element generated by D3. The business chain and statistics visualizations are pure D3. For the geographical visualization we needed an extra library since it required a map that allows for zooming and panning. Polymaps[16] is such a library, it is very similar to D3 as it is written by the same author. Polymaps loads map imagery from a source. We used bing Maps[2] imagery since chainels already used bing Maps. Polymaps is capable of plotting points on the map based on coordinates, so in fact D3 isn’t even needed for this. However our visualization also required lines (arcs) between points, and that is not possible in polymaps alone. This is why D3 is also needed for the geographical visualization. Creating the visualization starts in the file visualization.js, seen in Listing 1. InitVis() is called when the html body is loaded. Then we use Modernizr[14] to check whether the browser is compatible (supports svg) and load the appropriate javascript files. The get..() functions are called through buttons on the html page and switch the selected visualization accordingly. Listing 1 Excerpt from visualization.js 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
var companyID ; var baseURL ; f u n c t i o n initVis (c , u ) { companyID = c ; baseURL = u ; Modernizr . load ({ test : Modernizr . svg , yep : // load required javascript files , nope : // show update browser prompt complete : f u n c t i o n () { if ( Modernizr . svg ) { initGraph () ; } } }) ; } f u n c t i o n getGeo () {...} f u n c t i o n getBusiness () {...} f u n c t i o n getStats () {...}
The actual creation of the visualisation is done in the generate-...-.js files (Listing 2). Here the required data is requested through the d3.json() function. In the callback of that function the svg elements and behaviours are generated and defined.
5 Implementation
14
Listing 2 Requesting the data and creating the main svg element (excerpt of generate-pie-chart.js) 1 2 3 4 5 6 7 8 9 10
d3 . json ( baseURL + " Ajax / Statistics / " + action , f u n c t i o n ( data ) { // Creating the main SVG element var vis = d3 . select ( " # chainmap " ). select ( div ) . append ( " svg : svg " ) . data ([ data ]) . attr ( " width " , w ) . attr ( " height " , h ) . append ( " svg : g " ); }
5.1.2
Retrieving the data - RPC framework
The RPC framework needed as explained in section 4.1.2, will work by sending a php function name and if needed its arguments in a special formatted url. E.g. http://server.tld/Ajax/ Class/function/arg1/val1/arg2/val2/../.. would retrieve the data for Class->function( arg1 = val1, arg2 = val2 ). Because of the intrinsic differences between an rpc request and the actual back-end code, some glue-code cannot be omitted. To make this mapping beween rpc call and php-code consistent, a (url)Router-class parses the url request given by the http-server, and loads the corresponding Ajax_Controller_${Class}-class (where ${Class} is the value parsed from the url) which handles any peculiarities (like object initialization and resource authorization, etc) needed to fulfill the actual function which it delegates to a newly created Ajax_${Class}-object. The wrapper classes Ajax_${Class} on their part retrieve the output of the wanted function from the intended {$Class} and make the output marshallable, in the form of a json encoded string, which will be sent as output to the front-end – ready to be parsed by javascript. A sequence diagram has been provided in Figure 5 to give a visual impression of a rpc-request. sd RPC Framework index.php
: Router
: Ajax_Controller
: GraphData
: Actor <<request>>
route() : Ajax_Controller getClass() : Ajax_Controller_AbstractController <>
: Ajax_Controller_GraphData
<> <> invoke() : void
createAjaxClass() : void
<>
: Ajax_GraphData
getGraphData() : void
<> json data
Fig. 5: Sequence diagram of rpc-request
getGraphData() : void <> array
5 Implementation
5.1.3
15
Integrate into CHAINels
The html pages of chainels are completely generated in php. The page where the visualization should appear is generated in the file Chain.php in the chainels source code. The file has a very straightforward structure, it has functions such as getJSFiles() and getCSSFiles(). In Visualization.php (Listing 3) we defined similar functions that are then called in Chain.php. Visualization.php also contains all the other functions that are used to generate necessary html elements for the visualizations. Listing 3 Excerpt of Visualization.php 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
class Visualisation { public static function getJSFiles ( $company , $account ) { $internal = empty ( $account ) ? false : ( $company - > getCID () === $account - > getCID ()); if ( $internal ) { return array ( dirname ( dirname ( __DIR__ ) ) . ’/ graph / javascript / modernizr . js ’ , dirname ( dirname ( __DIR__ ) ) . ’/ graph / javascript / visualization . js ’ ); } else { return array ( dirname ( dirname ( __DIR__ ) ) . ’/ graph / javascript / modernizr . js ’ , dirname ( dirname ( __DIR__ ) ) . ’/ graph / javascript / external - visualization . js ’ ); } } ... }
5.2 5.2.1
Recommendation Friend Recommendation
In Section 4.2.1 we’ve discussed how we will use bfs as a means to traverse the network graph. The scoring function is injected through means of closures (passing another function reference through parameters of a function-call), this keeps the readability of the original bfs and provides a simple mechanism to inject other scoring functions. This means we can use the same bfs implementation for the fcr as well as the bcr algorithm. Another part which is done through closures is the method used to retrieve the children from a particular node. The scoring function itself following Adamic/Adar’s method takes 2 arrays, computes the intersection of those and then retrieves the frequency and then returns a score using the amount of outgoing relations of those intersected nodes.This means that the scoring function retrieves 2 arrays of companies and provides a score based on the number of chains of a common company. The code for the scoring function can be found in listing 4. 5.2.2
Business Recommendation
Association Rules To calculate the association rules we simply take all the chains in the database and take the industries of those companies so only the rules "industry ⇒ industry" are left. Calculating the scores of the association rules is not done on every run. We create a 16x16 matrix (there are 16 industries defined) to easily look up the values. This matrix is saved in the Redis database row by row, meaning that for each row, there is a key value pair keyi ⇒ (key0 ⇒ value, .., keyi ⇒ value, keyi+1 ⇒ value, ..) (where keyi is the key of industry i). How the matrix is filled and then saved can be seen in listing 5. For now the function to fill this matrix
5 Implementation
16
Listing 4 Excerpt of BFSRecommender.php 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
// generic Adamic / Adar f u n c t i o n public static function a da mi c Ad ar Sc o re ( $array1 , $array2 , $frequency ) { $ co mm on E le me nt s = a rr a y_ in te r se ct ( $array1 , $array2 ); $score = 0; foreach ( $c o mm on El em e nt s as $comm onElemen t ) { // note that $ f r e q u e n c y is not a variable , but a r e f e r e n c e $score += 1 / log10 ( $frequency ( $common Element ) ); } return $score ; } // based upon domain - model public static function friendScore ( $cid1 , $cid2 ) { $array1 = Chains :: g e t C o m p a n y C h a i n s I D s ( $cid1 ); $array2 = Chains :: g e t C o m p a n y C h a i n s I D s ( $cid2 ); $frequency = function ( $cCompany ) { return Chains :: a mountOfC hains ( $cCompany ); }; return BFSRec ommender :: ad am ic A da rS co re ( $array1 , $array2 , $frequency ); }
(fillDiagram();) is called manually, in Section 9 we make recommendations on when to run this function. We implemented each of the four "interestingness" functions defined in Section 4.2.2, to be able to later test which function gives the best result. A method is selected by giving fillDiagram( $method ); a parameter "lift", "support", "conviction" or "confidence". Listing 5 shows the structure of this class. Supply and Demand The supply and demand score is in a separate class. We defined a very trivial function of comparing supply and demand. The matching is done by comparing the demand of one with the supply of the other one and the other way around. This is done by simply checking whether the two strings are equal. When the matching is done it divides the outcome by the size of the first company’s demand and supply, returning a relative score. Combining the two parts The association rules and the supply and demand score get combined within the class BFSRecommender in the function businessScore(). The function takes 4 arguments, the first two are company id’s and the last two are the weights (respectively industry weight and supply/demand weight). This method calculates both scores and multiplies them with the associated weight and returns this score. The scores are subsequently sorted on distance if they are equal. The score function is then injected into the bfs algorithm, as described in Section 5.2.1.
5.3
SIG Feedback
During the 2nd week of the implementation phase, the code code was submitted to the SIG, Software Improvement Group, which reviewed the code and provided feedback, resulting in some pointers to improve the current codebase. A translated version in English is provided here, followed by sections giving response to the critique points thereafter. The original feedback in Dutch can be found in Appendix F.
5 Implementation
Listing 5 Excerpt of AssociationDiagram.php 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
class A s s o c i a t i o n D i a g r a m { public static function fillDiagram ( $method = ’ lift ’ ) { switch ( $method ){...} } public static function f i l l W i t h C o n f i d e n c e () { foreach ( $industry as $industryFrom ) { // Create each row of matrix foreach ( $industry as $industryTo ) { $scores [ $industryTo ] = confidence ( $industryFrom , $industryTo ); } s e t I n d u s t r y Sc o r e ( $industryFrom , $scores ); // Save to d a t a b a s e } } public static function fillWithLift (){...} public static function f i l l W i t h C o n v i c t i o n (){...} public static function f il lW i th Su pp o rt (){...} }
The codebase scored about 4 stars on our maintainability model, which means that the code is above average. The highest score is not given, because of a lower score for Unit Size, Unit Interfacing and Component Balance. For Unit Size our [SiG] system looks at the percentage of code that is above average in length. The splitting of these methods in smaller chunks provides better maintainabilty, testibility and ease of understanding on first glance. The ’getBusinessGraphJSON’methode in ’BusinessGraphData’ is an example of a too long method. Try to look critically at the longer methods in the system and refactor if needed. For Unit Interfacing, we looked at the percentage of code in units with an above average amount of parameters. Usually that signifies a lack of abstraction. Besides that, a large number of parameters usually leads to confusion in the calling of a method and in the most cases it also leads to longer and complexer methods. What is apparent about this system is that inside the ’AssociationRules’-class’s methods a variable “firstIndustryKey” as well as “SecondIndustryKey” is signatured in most methods. It appears these variables form a concept of a Rule. To make it more clear for future code-edits and know what to provide to these methods, it might be worth considering introducing a new type/class for this concept. For component balance, there’s been looked at the amount of components and the distribution of the amount code over these components. Both aspects influence the ease of finding some particular functionality within the system. What is apparent in this system is that on the first level distinction has been made between technology type (’hml’, ’javascript’, php). Within the html directory however is no html-formatted code, but also php. There appears to be made some distinction in directories for Ajax functionality within the PHP-directory but it appears more distinction in functionality could be made. Especially if the system will grow, we recommend to distribute the code over more components. In general the code scores above average and the existence of test-code is promising at the least. Hopefully the amount of test code will also grow as new functionality is added.
17
5 Implementation
5.3.1
18
Unit Size
After review, the unit size of some functions/methods was indeed long. Methods like ’getBusinessGraphJSON’ in the class ’BusinessGraphData’ have been refactored and common functionality has been moved outside the method which later on can be re-used. 5.3.2
Unit Interfacing
SIG’s review concluded that the method signatures in the ’AssociationRules’-class were unneedingly repetitive among several methods. To be more specific, there were several methods with “firstIndustryKey” and “secondIndustryKey” as their parameters. Their critique is valid, however these methods are not used together. As explained in this report, we implemented four different methods of calculating the industry scores (support, confidence, lift and conviction), to test which one would yield the best result. Only one of these functions will actually be used. If this was a function in any other type-friendly programming language in contrast to php, a tuple would’ve been an appropriate parameter. But a complete class seems overkill for such a primitive type as 2 junctioned keys, which are just integers. 5.3.3
Component Balance
The main point of critique is our distribution of code over components, is grouped by technology i.e. directories for php, Javascript and html files. Hardly any ’namespacing’ was done, we did introduce functionality separation in the Ajax-rpc subsystem. As our added functionality written for this project, would be placed as an module within chainels’ source-tree, we initially used this directory layout as it was expected by chainels because it mimicked their current layout. It did feel as an arbitrarily chosen layout then and it does now, however before serious replacement of the components can be made, unit-tests and integral tests need to be up to par before beginning a high level refactoring, this includes the code of chainels. Recommendations for future development on the code base is handled in the separate dedicated section 9. We did however split up the code into two components "graph" (for the visualization part) and "recommendation" (for the recommendation part). In these two components we kept the same structure as described (grouped by technology).
6 Testing
6
19
Testing
6.1
Unit Tests
We defined unit tests for each class. The unit tests made use of a framework named SimpleTest[18], that was already used by chainels. These unit tests were used to validate that our recommendation algorithms worked to a certain level (calculating values) and that the classes who parsed data for the visualizations, parsed their data to json format correctly. That last part was necessary in the beginning, before we discovered a handy php function called json_encode(). We first tested the Javascript files for the visualization with no dependencies on the classes that provide the data. We did this by using a separate json file. 6.1.1
Mocking and Current Design Limitations
Due to the current database and software structure we couldn’t make use of mocks, because almost every class is tightly coupled wih the database and dependent on its results (e.g. auto incrementing keys). A concrete example is that a typical model class like Company currently has knowledge about the inner workings of the database (Redis) to save its own state to the database, this goes directly against the principle of Separation of Concerns [7] and thus makes testing more difficult than it needs to because making a simple mocked object resulted in the need of a complete database. Without having to rewrite the existing codebase, we solved this problem partially by making an array of companies. That set of companies is then saved on-the-fly into the Redis-database (needed due to the tight coupling in the codebase), using chainels’ existing functionality. Afterwards, certain chains were created among these companies to represent some (sub)sets of connected companies. These steps are the bootstrap we needed to do before actually starting with any test. When the test class ends it calls a function that removes the companies, which were previously added – from the database, to keep a clean slate for next tests by guaranteeing database state. The common set of these ’mocked’ companies introduced another problem, because individual team members occasionally introduced new data to it – to test their own individual test classes’ edge cases – which could make unit tests of other classes fail due to their dependence on the set’s state. This problem was quickly noticed and we searched for a way to solve this easily without investing to much time in it. The way it was solved was done by making an external file containing those same companies in a special but straightforward format. The benefit of the external file is that it can be easily adjusted (adding companies, adding relations). In the unit tests we first call the external file and then create the desired relations (chains) set by selecting a relations set within the external file. Listing 6 shows an excerpt from the external file. The external-file solution, is not ideal, but was necessary to circumvent chainels current design limitations. Recommendations to better the existing codebase’s current design will be provided in Section 9.
6 Testing
20
Listing 6 Excerpt from the external file [0] aboutText = " B ac h el or pr o je ct !" i sI nt er n at io na l = true busi nessnumb er [] = ’51722909 ’ busi nessnumb er [] = ’NL ’ mission = ’ Eenvoudige software maken voor iedereen ’ comp anySizeK ey = ’2 -5 ’ founded = ’2012 ’ basicSupply [] = ’ Screens ’ basicSupply [] = ’ Servers ’ basicSupply [] = ’ Books ’ basicDemand [] = ’ Computer ’ basicDemand [] = ’ Webshop ’ basicDemand [] = ’ Software ’ streetname = ’ Mekelweg ’ streetnumber = ’4 ’ zipcode [] = ’2628 CD ’ zipcode [] = ’NL ’ state = ’ Zuid - Holland ’ city = ’ Delft ’ name = ’ Network Visualisation Project ’ industryKey = ’in ’ industryLabel = ’ Software en Websites ’ countryKey = ’NL ’ rel [] = "0 -3 -3 ,1 -2 -3 ,3 -1 -3 ,5 -0 -3 ,9 -0 -3 ,8 -0 -3 ,7 -8 -3 , 7 -6 -3 ,9 -6 -3 ,4 -2 -3 ,1 -10 -3 ,1 -0 -3 ,11 -2 -3"
6.2
Integration Tests
Once the unit tests showed that the data classes for the visualization, and the RPC framework worked properly, we integrated the back-end into the front-end (the javascript files). Since then the visualization was tested mostly by user tests (essentially making each user test an integration test). The recommendation algorithm was tested to see whether it performed on accuracy and running time. The results of those tests can be seen in Section 7.
6.3
Acceptance/User Tests
Since the visualization is primarily behaviour based, user testing was key. We user tested after we implemented each new feature, using the Javascript Console available in most browsers, to check for errors. User tests were done also by a few members of chainels itself, this came in handy to see if it worked according to their expectations/needs. After each iteration we got feedback on the application from the chainels team, that we discussed and processed.
7 Recommendation Performance Evaluation
7 7.1
21
Recommendation Performance Evaluation Creating a Test Set
We quickly realized that in order to test our recommendation algorithm, we would need a realistic data set, with meaningful relations between companies, real addresses, real industries and matching supply and demand. The problem was this data set didn’t exist. We wanted to use the old database, this old database is from the beta phase of chainels and is filled with companies that signed up in that phase. Unfortunately this data set is inappropriate, as many companies simply created chains to try the system out, which meant the relations in this data set were not of any use for us. This is why we decided to create our own data/test set. In total we created 41 companies, companies that we found on the internet, defined supply and demand for each of them and then made chains between them according to supply and demand, just like a real company would probably do. Although the data set is not "real" it is an as realistic as possible data set we could have. Figure 6 shows the test set visualized in a graph. 35 26 37 25 39 31 30 5
41 14 8
34
16
29
40 28
27
12 32
4
3
23
9
13 20 2 1
38 6
17
7
36
18
11 33
24 15 22 19
10 21
Fig. 6: Visualization of the test set
7.2
Testing our Algorithms
The fcr algorithm uses Adamic/Adar to provide scores. As this method has already been researched and tested extensively [24, 11], testing this algorithm on its accuracy wasn’t a priority for us. For our bcr algorithm we used our own defined test set. For each company whose recommendations are calculated, we first randomly remove a chain, then run the recommendation algorithm, and then check if the top five computed recommendations contains the removed chain. We did this 100 times to calculate the average results. Algorithm 2 shows the test algorithm.
7 Recommendation Performance Evaluation
22
Algorithm 2 Testing the business recommendation algorithm 1 loadDataIntoDatabase ( ) ; 2 f o r a l l i where 0 <= i <= 100 3 f o r a l l Companies c i n Database 4 removed = c . removeRandomChain ( ) ; 5 r e c s = c . getRecommendedBusinessChains ( ) ; 6 i f r e c s . c o n t a i n s ( removed ) 7 s c o r e ++; 8 end i f 9 end f o r 10 s c o r e = s c o r e / numberOfCompanies ; 11 t o t a l S c o r e += s c o r e ; 12 end f o r 13 a v e r a g e S c o r e = t o t a l S c o r e / 1 0 0 ;
The bcr algorithm has a number of parameters: • Association Rule calculation method (lift, confidence, conviction, or support) • The weights gi and gsd
To decide which parameter values to choose, we defined test cases that will be discussed in the next section.
7.3
Accuracy Results
Figure 7 and Table 1 show the accuracy results of the bcr algorithm test cases defined in the previous subsection (7.2).
Lift Confidence Conviction Support
gi = 0.7, gsd = 0.3 0.8065853 0.6656097 0.5917073 0.4139024
gi = 0.6, gsd = 0.4 0.7902439 0.7258536 0.67 0.4075609
gi = 0.5, gsd = 0.5 0.7821951 0.7670731 0.7143902 0.4146341
gi = 0.4, gsd = 0.6 0.7741463 0.7648780 0.7358536 0.4182926
Tab. 1: Accuracy results of BCR Algorithm.
gi = 0.3, gsd = 0.7 0.8039024 0.7680487 0.7451219 0.4736585
7 Recommendation Performance Evaluation
23
ϵϬ͕ϬϬй ϴϬ͕ϬϬй ϳϬ͕ϬϬй ϲϬ͕ϬϬй
>ŝĨƚ
ϱϬ͕ϬϬй
ŽŶĨŝĚĞŶĐĞ
ϰϬ͕ϬϬй ϯϬ͕ϬϬй
ŽŶǀŝĐƚŝŽŶ
ϮϬ͕ϬϬй
^ƵƉƉŽƌƚ
ϭϬ͕ϬϬй Ϭ͕ϬϬй
ŐŝсϬ͕ϳ͖ŐƐĚсͲϬ͕ϯ ŐŝсϬ͕ϲ͖ŐƐĚс Ϭ͕ϰ
ŐŝсŐƐĚсϬ͕ϱ
ŐŝсϬ͕ϰ͖ŐƐĚс Ϭ͕ϲ
ŐŝсϬ͕ϯ͖ŐƐĚс Ϭ͕ϳ
Fig. 7: Accuracy results of BCR Algorithm. Comparing our bcr algorithm with the old recommendation algorithm, that can be seen in figure Figure 8, we see that the bcr algorithm gives a much higher accuracy.
90 80
Accuracy (%)
70 60 50 40 30 20 10 0 BCR
Old Recommender
Fig. 8: Accuracy of BCR and old reccomenders.
7.4
Running Time Results
Figure 9 shows a box plot with the running times of the bcr (with lift calculation method) and the fcr algorithm, next to the old recommendation algorithm. The difference between the fcr and the old algorithm is minimal (10−3 ). The bcr algorithm obviously takes longer, as it covers more nodes for each run.
7 Recommendation Performance Evaluation
24
0.055 0.05 0.045 0.04
time (s)
0.035 0.03 0.025 0.02 0.015 0.01 0.005 BCR
FCR Recommender
Old
Fig. 9: Running times (in seconds) of each of the recommenders.
Table 2 shows the running times for each of the association rule calculation methods. The differences are minimal, confidence has the longest running time, conviction is the fastest. Association rules Lift Confidence Conviction Support
Time (seconds) 0.034100090146065 0.034936551690102 0.032531196653843 0.033119720101357
Tab. 2: Running time for each association rule calculation method
7.5
Conclusions
Our test results show a significant improvement in accuracy of the recommendations of up to 70%. We have to keep in mind that the test set is an optimistic one, with perfect matching supply and demand labels, and chains created by us. Though even with taking this into account, it is clear that the bcr algorithm is able to give relevant business recommendations. Figure 7 also shows us that the lift calculation method consistently provides the best accuracy. We therefore chose this calculation method, with the weights gi = 0.3 and gsd = 0.7 as our default parameters for the bcr algorithm.
8 Requirements Evaluation
8
25
Requirements Evaluation
In this section we will evaluate the requirements we defined prior to implementing and check if we implemented all the requirements and if not, why. The requirements can be found in Section 3 (or in Appendix E), we will refer to the requirements using the same numbers as in that section.
8.1
Checklist
Requirement 1 (geographical) 2 (full screen) 3 (switching) 4 (clicking on nodes) 5 (colored chain types) 6 (older browser) 7 (different visualizations) 8 (different views)
9 (simple graph) 10 (time slider) 11 (zoom) 12 (filter chain type) 13 (recommended chains) 14 (stats distribution of industries) 15 (stats - chains over time) 16 (user/visitor) 17 (marked common chains)
Implemented?
! ! ! ! ! ! ! !/# ! ! ! ! ! ! ! ! #
Comment
Mouseover instead of click
When defining the requirements we didn’t have clear idea yet on how the visualization would look. During implementation we found that different views didn’t add anything to the geographical view. We did however implement this for the business chain graph, it has two views: based on industry and based on chain type.
We found that adding an extra marking for shared companies on the geographical map, would make the visualization appear very crowded, so we decided against it.
8 Requirements Evaluation
Requirement 18 (rec algo. features)
26
Implemented?
!/#
19 (private) 20 (improve algorithm) 21 (user-interface) 22 (lots of users)
! ! ! !
23 (lots of data in graph) 24 (compatibility)
! !
Comment Chaintype isn’t taken into account, since there is no need for it. The algorithm looks at company properties, not of the chains. Competition isn’t taken directly into account, however implicitly it would not be a big problem. Supply and demand would score very low, and "friend" companies would be doing business with you sooner than your competitor, which means your friends aren’t likely to have chains with your competitor. So both algorithms would not recommend a direct competitor. See Section 7.
The visualization runs client side, so server side it isn’t an issue. The recommenders run relatively fast, but further optimization is required (see Section 9)
IE9 does not show the tooltips on the SVG elements (the businesscard) this is a bug in tipsy (the jQuery plugin we use for the tooltips), that we haven’t been able to fix.
9 Recommendations for Future Work
9
27
Recommendations for Future Work
In the weeks during the project, we gained insights about the system which were not explicit or even known at all at the beginning of the project. Some of these insights, gained knowledge, and ideas which can be put to use in future projects by chainels will be formalized in this section.
9.1
System Extensions
In the period working on this project, some additional functionality was considered as would like to have in the near future. The main features which are likely to be helpful on a short-term basis are: 1. Export statistics as pdf - Based upon (end-)user tests performed by the Chainels member Buckers and some early adopting users, users would like to be able to save the statistics for offline and/or (e-)paper use. Exporting the statistics to pdf-format seems the most obvious choice . 2. Reasoning for recommendation - Based upon the user testing, users would like to see why a certain recommendation has been provided to them. A concise overview of supply and demand fulfillments (and possibly other future algorithm parameters) should be provided when a user gets a recommendation. 3. Declining recommendations - As currently implemented, recommendations are considered ’right’ in the sence that an end-user cannot tell the system that the recommendation is actually wrong. An end-user should be able to mark an recommendation as faulty. This has two advantages: 1) the user’s interface is not cluttered with non-interesting recommendations and 2) the system can learn from these faulty recommendations, although a suitable algorithm would’ve to been found or created first to make use of this extra information to further increase the accuracy of the recommendations.
9.2
Optimization
During the implementation period some unforeseen complexities arose, the 2 prominent main problems are in the Bread-First-Search functionality and in the supply and demand subsystem. 9.2.1
Breadth-First-Search
While testing the recommendation algorithms we stumbled upon a problem. This problem was that in our testset two subsets were formed. Between the two subsets there was no link, which means that bfs won’t recommend companies from the other subset. There might have been a good recommendation within the other subset, but these couldn’t be found. This problem often shows up when someone is new to the network, meaning he doesn’t have chains yet. We have solved this, though the problem remains, by making recommendations based on locations. This means that the nearest companies get recommended. As this problem can also happen with users who do have chains, we offer a solution: instead of using BFS to search for companies within the network by using an other method called Random Walk[15]. Random Walk has many variants and the benefit of using this algorithm is that it can reach other companies that aren’t chained and thus improving the recommendation. 9.2.2
Labels/Synonyms
The current implementation for checking the score between 2 companies based upon their supply and demand is fairly trivial and from a high-level perspective checks for matching supply and demand entries by iterating over them word by word.
9 Recommendations for Future Work
28
The problem is that multiple companies can have different words for the same item/concept, e.g. car and vehicle can often be interchanged in the same context, yet the current implementation would score these as a low match resulting in missed opportunities for the respective car dealer. Two possible solutions came to mind during brainstorm sessions. 1) Using thesauri databases[23] we can deduce the meanings of words and retrieve a set of similar meaning words, before matching them against another company’s supply and demand. 2) Another option is to take away the completely free input from users when filling in their supply and demand and introduce a label/tagging system, in which companies should fill in pre-existing labels which fit and describe their item best. This would make the scoring function better because in essence the problem of finding words with similar meaning is being placed on the companies who themselves know best how to describe their item, with the labels provided. In the example above the only label available would be ’car’ and other synonyms like ’vehicle’ would be omitted from the possibilities. 9.2.3
When to run the Algorithm
As Section 5.2.2 explains, the storing of the industry scores is done by saving a matrix in the database. Now, the function to fill this matrix is called manually. This is a far from ideal solution, so there are two ways to handle this. Run the function after a certain amount of time, or after x amount of chains have been created. The second option would be best, since the amount of chains chained since the last run is always the same.
9.3
Recommendations for CHAINels
During the project some workflow inefficiencies surfaced as well as doubts about some parts of the current software design with regard to cleanliness and component balance (which was already pointed out by the SiG in Subsection 5.3.3). In the following sections each point of critique will be handled. The first 3 following sections point out workflow improvements, especially applicable when the developers team grows. The 2 following sections after that apply to the codebase itself. 9.3.1
Code Reposititory Branches
The current state of the repository is a svn server with one branch, in which everybody checks in their changes. In essence this is not totally wrong, but lacks some best practices. Often multiple branches are provided for different teams or even for individual persons. In our case we were sometimes ’fighting’ against another developer team which made changes in the same files we did. Although code repositories are partially invented for the “source conflict”-problem, the amount of conflicts which have surfaced on trivial commits could have been mitigated by the use of branches, which would have been merged on specific times, in which conflicts could have been resolved in a more efficient way. 9.3.2
Check in/push to a webserver
In addition to the use of branches, another best practice can be applied, often in software projects the trunk-branch is directly coupled to a webserver, so the code can be tested once it’s been committed. The problem is that currently the workflow demands that you commit (or don’t) to the svn server and then upload your files to the webserver. This introduces a coherency problem, you don’t know if your own team has uploaded everything up till revision ’n’ let alone another team’s upload history (or even worse, if they uploaded in the time between your upload and the actual testing). In essence the problem is that you cannot guarantee the state of the uploaded code - while using a code-repository coupled with a webserver would easily ease this burden. 9.3.3
Test Locally
To mitigate the webserver-state problem even more without forcing your team to use the same webserver – which introduces delays while another team-member is testing and using the webserver
9 Recommendations for Future Work
29
– is by testing on the local developer machine. This introduces some extra requirements on the developer’s machine, but pays off in the end by having the ability to test one’s code at any moment. Another positive side effect is that the code will be more resilient against heterogeneous hardware and software, improving portability of the software to different systems which might be worth the extra effort when multiple servers are needed when scaling out in the future. 9.3.4
Reduce Tight Coupled Subsystems
At this moment there is not really a separation of concerns in the software design. More concrete examples are the use of typical model classes like Company and Chains. These model classes have to much knowledge of the back-end system and are interacting directly with the database for example. A simple observation of the vast amount of static functions in each of these classes indicates that these classes are merely a one-dimensional namespace for common functionality and as such is perpendicular to proper object oriented design. The separation of the functionality would also improve the testability for these classes, which is very hard at the moment because even the most simple unit-tests now depend on a complete working Redis-server. 9.3.5
“Premature optimization is the root of all evil”
The title of this subsection is a quote of renowned computer scientist Donald Knuth[9](pp. 261–301, §1). Knuth is not trying to discourage efficient code for the sake of elegant code, but implies that optimization should not be an aim of a project. In the case of the chainels codebase, there are some questionable choices which are justified for the sake of optimized code. As a concrete example is the import subsystem which handles the just-in-time availability for needed data-structures/classes. Many methods are cluttered with multiple import statements. This is not needed and the php-language has some very elegant ways to do this already. The lack of any tests and profiling for justification for this self-invented import system are missing, but the code is cluttered nonetheless, not even speaking of the double bookkeeping of having every class name in an array of constants which further reduces ease of re-factorization.
10 Conclusion
10
30
Conclusion
The goal of this project was to create a network visualization for a company network, and to design a new chain recommendation algorithm. We defined a recommendation algorithm that performs significantly better then the old recommendation system currently used by chainels. It not only recommends companies based on "friend" relationships, but also recommends new business opportunities by looking at supply and demand, and industries. The network visualization provides new ways for companies to view and analyze their business networks. The people of chainels have shown our work to companies during meetings, and the response has been overwhelmingly positive. It exceeded the expectations of chainels, who at first didn’t really know how it would turn out, and just thought about it as an extra feature, but they now consider it a key feature of chainels. For us, the project has been a great learning experience. We learned much about the field of recommendation algorithms, and also got more experienced using programming languages as php and JavaScript. In the end, we can look back on a successful project, and we are happy to see our work used on chainels. We wish them the best in the future! Hans van den Bogert Christiaan Titos Bolivar André Vollering
11 Project Log
11
31
Project Log
11.1
Process
In our plan of approach document (Appendix C) we specified the way we approached this project. During our orientation phase we primarily focused on the visualization aspect of the project, we were all novices at recommendation algorithms so we had to start from scratch in learning the field. The orientation on the recommendation algorithms therefore bleeded into the design phase. Late in the design phase we were assigned our supervisor. This was already after we finalized the rad document. We think had we gotten a supervisor earlier on, we would have been able to make a better rad document. It was also unfortunate that we made the plan of approach and orientation report in Dutch, since we didn’t know who our supervisor was going to be. After writing our rad document the implementing phase started. We chose scrum as software development method. In hindsight this wasn’t the best method we could choose. We quickly found that we were sitting right next to each other every day, defining and assigning tasks was done just by talking to that person. In the end our development method could better be described as "scrum-like" or just a generic agile method. At the start of each iteration we discussed and wrote down our progress so far, what we did last iteration and what will be done this iteration. We then added those tasks into our tracker system. For this issue tracker system we used Trac [19]. But due to some unfortunate events, the server where trac ran on accidentally got removed during iteration 3 and we lost all our data. After that we decided to just use Google Docs, since the issue tracker didn’t really fit into our workflow anyway. Again, the wrong decision to use scrum could have been prevented had we had our supervisor earlier. During our fist meeting he was quick to notice that scrum probably was not going to work. Our eventual way of working was based on his advice. We divided the implementation amongst us as follows: Hans Friend Recommendations, rpc Framework Christiaan Visualization (Geo and Stats), front-end Andre Business Recommendations, Visualization (Business)
11.2
Logs
Iteration 1: What will be done • Visualization – Geographical ∗ Setting up d3 and polymaps. ∗ Insert map on chainels page. • Recommender – Create the required classes for the algorithms. Iteration 2: What will be done • Visualization – Geographical graph ∗ Chain filters
11 Project Log
∗ ∗ ∗ ∗
32
Industry colors Busininess card popup bing maps Toggle Full screen
• Recommender – Write and Unit test the algorithms Iteration 3: What will be done • Visualization – Adding three buttons to switch visualization – Geografical graph ∗ Internal/external ∗ Research/Investigate clustering ∗ Time slider
– Business chain graph – Statistics
∗ Begin implementing pie-charts
– Implement rpc framework for efficient and secure data requests • Recommendation – consensus about when and how to invoke the algorithm(s) – how to store scores for efficiency/cache. Iteration 4: What will be done • Visualization – Switch geo and business chain over to new json-rpc framework – Finish up styling – Geographical ∗ Only show industries that you have chains with in legend ∗ In GraphData use json_encode() instead of hand coding the JSON
– Business chains
∗ time slider functionality ∗ In BusinessGraphData use json_encode() instead of hand coding the JSON
– Statistics
∗ number of chains over time -> line graph ∗ pie chart with type of chains • Recommendation
11 Project Log
– consensus about when and how to invoke the algorithm(s) – business recommendation not by breadth first search? -> after testing – how to store scores for efficiency/cache. – integrate it into chainels – testing! Iteration 5: What will be done this iteration • processing SIG feedback • Visualization
– fullscreen ∗ switching business chain vis should remain full screen ∗ external full screen
– finish up styling
∗ business card -> willem
– IE9 bugs
• Recommendation – testing – storing association rules in database – creating test set
33
11 Project Log
34
References [1] L.A. Adamic and E. Adar. Friends and neighbors on the web. Social networks, 25(3):211–230, 2003. [2] Bing Maps. Bing Maps APIs, SDK, and application development tools. URL http://www. microsoft.com/maps/. [3] Changhao Jiang, Facebook . BigPipe: Pipelining web pages for high performance. URL http://www.facebook.com/note.php?note_id=389414033919. [4] d3.js. D3.js - Data Driven Documents. URL http://d3js.org/. [5] Facebook. Welcome to Facebook - Log In, Sign Up or Learn More. URL http://www. facebook.com/. [6] Graphviz. Graphviz | Graphviz Visualization Software. URL http://www.graphviz.org. [7] W.L. Hürsch and C.V. Lopes. Separation of concerns. 1995. [8] jQuery. jQuery: The Write Less, Do More, JavaScript Library. URL http://jquery.com/. [9] D.E. Knuth. Structured programming with go to statements. ACM Computing Surveys (CSUR), 6(4):261–301, 1974. [10] D. Liben-Nowell and J. Kleinberg. The link-prediction problem for social networks. Journal of the American society for information science and technology, 58(7):1019–1031, 2007. [11] David Liben-Nowell and Jon Kleinberg. The link prediction problem for social networks. In Proceedings of the twelfth international conference on Information and knowledge management, CIKM ’03, pages 556–559, New York, NY, USA, 2003. ACM. ISBN 1-58113-723-0. doi: 10.1145/956863.956972. URL http://doi.acm.org/10.1145/956863.956972. [12] LinkedIn. World’s Largest Professional Network | LinkedIn. URL http://www.linkedin. com/. [13] Michael Hahsler. A Comparison of Commonly Used Interest Measures for Association Rules . URL http://michael.hahsler.net/research/association_rules/measures.html. [14] Modernizr. http://modernizr.com/. URL http://modernizr.com/. [15] K. Pearson. The problem of the random walk. Nature, 72(1865):294, 1905. [16] PolyMaps. PolyMaps. URL http://polymaps.org/. [17] Redis. Redis. URL http://redis.io/. [18] SimpleTest. SimpleTest - Unit Testing for PHP. URL http://simpletest.org/. [19] Trac. The Trac Project. URL http://trac.edgewall.org/. [20] Wikipedia. Ajax (programming) - Wikipedia, the free encyclopedia, . URL http://en. wikipedia.org/wiki/Ajax_%28programming%29. [21] Wikipedia. NoSQL - Wikipedia, the free encyclopedia, . URL http://en.wikipedia.org/ wiki/NoSQL. [22] Wikipedia. Six degrees of separation, . degrees_of_separation.
URL http://en.wikipedia.org/wiki/Six_
[23] Wikipedia. Thesaurus Wikipedia, the free encyclopedia, . URL http://en.wikipedia.org/ wiki/Thesaurus. [24] Evan Wei Xiang. A survey on link prediction models for social network data. Science And Technology, 2008. URL http://ihome.ust.hk/~wxiang/Tutorial/LinkPrediction.pdf.
11 Project Log
35
List of Figures 1 2 3 4 5 6 7 8 9 10 11 12 13
Project planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mockup of the User interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Class Diagram for the Recommendation component . . . . . . . . . . . . . . . . . Class diagram for a part of the RPC framework (not every method of GraphData is shown) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sequence diagram of rpc-request . . . . . . . . . . . . . . . . . . . . . . . . . . . . Visualization of the test set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Accuracy results of BCR Algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . Accuracy of BCR and old reccomenders. . . . . . . . . . . . . . . . . . . . . . . . Running times (in seconds) of each of the recommenders. . . . . . . . . . . . . . . Geographical visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Business chain visualization (grouped on chain type) . . . . . . . . . . . . . . . . . Business chain visualization (grouped on industry) . . . . . . . . . . . . . . . . . . Statistics visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3 6 11 12 14 21 23 23 24 36 37 37 38
List of Tables 1 2
Accuracy results of BCR Algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . Running time for each association rule calculation method . . . . . . . . . . . . . .
22 24
List of Listings 1 2 3 4 5 6
Excerpt from visualization.js . . . . . . . . . . . . . . . . . . . . . . Requesting the data and creating the main svg element (excerpt of chart.js) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Excerpt of Visualization.php . . . . . . . . . . . . . . . . . . . . . . Excerpt of BFSRecommender.php . . . . . . . . . . . . . . . . . . . Excerpt of AssociationDiagram.php . . . . . . . . . . . . . . . . . . . Excerpt from the external file . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . generate-pie. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13 14 15 16 17 20
Friend Chain Recommendation (fcr) algorithm . . . . . . . . . . . . . . . . . . . . Testing the business recommendation algorithm . . . . . . . . . . . . . . . . . . . .
8 22
List of Algorithms 1 2
A Screenshots
A
36
Screenshots
Fig. 10: Geographical visualization
A Screenshots
37
Fig. 11: Business chain visualization (grouped on chain type)
Fig. 12: Business chain visualization (grouped on industry)
A Screenshots
38
Fig. 13: Statistics visualization
B Problem Description
B
Problem Description
39
Bachelorproject CHAINels – Netwerk visualisatie Groepje -
Hans van de Bogert (1307983) Christiaan Titos Bolivar (4009010) André Vollering (1516485)
Over CHAINels “Uiterst kort is CHAINels te beschrijven als ‘de Facebook voor bedrijven’. In meer detail is het een professionele netwerktool die bedrijven en organisaties met andere bedrijven en organisaties verbindt (chains). Chainels kan worden gebruikt om zakelijk contact te houden met relevante relaties. Het ‘keyconcept’ is dat het puur op bedrijven gefocust is; het geheel moet dus een hoog professionaliteitsgehalte hebben. Dit is dan ook de onderscheiding tussen Chainels en al bestaande netwerken als Facebook, Twitter, LinkedIn, enzovoorts: het is bedoeld voor pure bussiness-to-business networking. Met dit oogpunt is de naam Chainels ook te verklaren: het belangrijkste voor een bedrijf is zijn ‘chain’ en zijn ‘channels’. Bedrijven houden via Chainels professioneel contact met hun relaties en kunnen vraag en aanbod met betrekking tot diensten en producten onbeperkt ‘uploaden’ en meer leren over bedrijven waarmee ze zaken doen.” Het plan is dat CHAINels binnenkort online gaat met een (nieuwe) beta versie. Er wordt klein begonnen met lokale ondernemingen, maar de ambitie is om uit te groeien tot een wereldwijd netwerk met miljoenen gebruikers. Om die reden moet dus de software die nu gemaakt worden ook klaar zijn voor de toekomst.
De Opdracht
Figuur 1 Het netwerk van een bedrijf
Op een van de drie pagina’s van een bedrijf/profiel op CHAINels wordt het netwerk van dat bedrijf getoond. Dat wordt gedaan op drie manieren: tegels van de bedrijven in je netwerk, een businesscard (een ingezoomde tegel) en de netwerkvisualisatie op een kaart (zie figuur 1). Wij zullen ons bezig houden met die netwerkvisualisatie. Het idee is om een graaf op de kaart te projecteren. Die graaf moet op een duidelijke manier het volledige bedrijfsnetwerk van de gebruiker weergeven. De vraag is, wat is duidelijk en hoe kun je het nuttig houden. Het is makkelijk voor te stellen dat bij een bedrijf dat mogelijk honderden chains heeft, het al snel een rommel wordt. Het is dus noodzakelijk chains te groeperen. Dit zal moeten gebeuren op locatie, maar ook op bijvoorbeeld industrie. Ook kan bijvoorbeeld een bedrijf wat heel veel chains heeft “groter” worden gemaakt op de graaf om aan te geven dat het een groot bedrijf is. Al deze mogelijkheden moeten worden bepaald en tegen elkaar worden afgewogen om tot de beste visualisatie voor een gebruiker te komen. Belangrijk is dus dat de visualisatie duidelijk moet zijn, bij zowel een grote hoeveelheid informatie, als bij een kleine hoeveelheid. Ook kan in de graaf mogelijkheden worden getoond voor nieuwe connecties (recommended chains). Hiervoor moet een recommendation-algoritme worden ontwikkeld aangezien het huidige algoritme erg simplistisch is en niet voldoet. De vraag hierbij is dus, wat is een relevante chain-aanbeveling voor een bedrijf? Bedrijven in de omgeving? Van dezelfde industrie? Van mogelijke leveranciers? Hier is geen makkelijk antwoord op te geven. Bij een visualisatie is het ook van belang dat het interactief is. Anders is het maar een plaatje. Er moet dus kunnen worden ingezoomd, kunnen worden geklikt op een node/bedrijf (waar dan bijvoorbeeld de businesscard getoond kan worden), en de hierboven besproken clusters/groepen moeten kunnen worden in- en uitgeklapt. Het is dus duidelijk dat er veel rekenwerk gaat zitten in een netwerkvisualisatie. Dit moet uiteraard ook zo efficiënt mogelijk worden geregeld. CHAINels is geprogrammeerd in object-georienteerd PHP met een NoSQL database (Redis). Hiermee zullen wij dus ook werken.
C Plan of Approach
C
Plan of Approach
42
Plan van aanpak CHAINels - Netwerkvisualisatie Hans van den Bogert - Christiaan Titos Bolivar - André Vollering
Voorwoord Dit is het plan van aanpak van het Bachelorproject voor CHAINels. In dit document wordt uitgelegd hoe het project zal worden aangepakt. Hierin zal de planning, de eisen en de opdracht kort en zo duidelijk mogelijk worden besproken.
Samenvatting Het project gaat over de opdracht Netwerkvisualisatie, waarbij verwacht wordt een feature te maken die het netwerk op een mooie manier zal visualiseren. Voor dit project worden 10 weken gebruikt en er zal gezorgd worden dat de code klaar is voor 10 juni (tijdens Fase III), want dan zal het voor controle opgestuurd worden naar het SIG (Software Improvement Group).
1. Introductie Dit project is gestart als aanleiding van het bachelor eindproject. De groep is gevormd door contact via de bijbehorende course op de blackboard site. Na een gesprek bij twee mede-oprichters van het bedrijf CHAINels, Erwin Buckers en Vincent Koeman, waren de interesses gewekt en hebben we besloten om hier een project te volgen. Na overleg is besloten om voor de opdracht Netwerkvisualisatie te gaan, waarbij het de bedoeling is om het netwerk te visualiseren door middel van een graaf. Een netwerk kan gezien worden als een web van bedrijven die gelinkt zijn door middel van “chains” (een connectie tussen twee bedrijven). Na een gesprek met het hele team van CHAINels, hebben we onze opdracht door de coördinator laten goedkeuren. Dit verslag zullen we laten controleren/goedkeuren door onze begeleider die in deze week aan ons zal worden toegewezen, en aan de hand van de gegeven feedback zullen we deze feedback in het verslag implementeren en verder gaan met de volgende stappen in het project. Dit document is als volgt opgebouwd. Eerst wordt de opdracht beschreven, waarbij alle eisen en verwachtingen worden weergegeven. Hierna zullen we een fasering en tijdsplanning van het project definiëren. Dan komt de projectinrichting, dit zal gaan over de organisatie, de rapportering enzovoorts. Als laatste zal beschreven worden wat het plan is om de kwaliteit van dit project te waarborgen.
2. Projectopdracht
2.1 Projectomgeving CHAINels is een website die op moment van schrijven nog in ontwikkeling is. Er is een oude beta versie momenteel online en binnenkort zal een nieuwe open beta worden gelanceerd. Op een van de drie pagina’s van een bedrijf/profiel op CHAINels wordt het netwerk van dat bedrijf getoond. Dat wordt gedaan op drie manieren: tegels van de bedrijven in je netwerk, een businesscard (een ingezoomde tegel) en de netwerkvisualisatie op een kaart dmv een graaf. Wij zullen ons bezig houden met die netwerkvisualisatie.
2.2 Doelstelling project De doelstelling van het project is het creëren van een visualisatie van een bedrijfsnetwerk. Voor intern gebruik moet dit inzicht bieden in het netwerk zelf. Bijvoorbeeld het bereik van je netwerk (geografisch of per industrie), de groei van het netwerk, maar ook suggesties bieden voor nieuwe chains. Extern kunnen andere bedrijven in één opslag zien hoe het netwerk van een ander bedrijf eruit ziet.
2.3 Opdrachtformulering Het visualiseren van een bedrijfsnetwerk door middel van een graaf, dat op een duidelijke en heldere manier inzicht geeft in het netwerk, en daarbij ook suggesties kan geven en tonen voor uitbreiding van dat netwerk (chain-suggesties).
2.4 Op te leveren producten en diensten Het eindproduct is een netwerkvisualisatie en analyse tool dat is geïntegreerd in de CHAINels website.
2.5 Eisen en beperkingen Globaal zijn de eisen: ● Geeft visualisatie van het netwerk van een bedrijf. ● Geeft suggesties voor nieuwe chains. ● Biedt de mogelijkheid tot analyse van het bedrijfsnetwerk (intern). ● Houdt rekening met de gegevens die een bedrijf wel en niet wilt delen (privacy) ● Performance moet zijn geoptimaliseerd.
2.6 Voorwaarden Voor een goede chain-suggestie is meer informatie nodig over een chain dan dat nu beschikbaar is. Het is nu in het systeem dat er wel of geen chain is tussen twee bedrijven. Het is nodig dat er een kwantificering van chains wordt geimplementeerd, dus wat voor soort relatie hebben die twee bedrijven. Of het development team van CHAINels moet dit doen, of wij kunnen dit (na overleg) in de opdracht betrekken. Het development team heeft dit geimplementeerd nadat we dit probleem met hun besproken hebben. De chains hebben nu een type en een tijd van aanmaak.
3. Aanpak en tijdsplanning De opdracht zal in 4 fasen verdeeld worden. Elke week is er een meeting met CHAINels waarin de voorafgaande week qua werkzaamheden en resultaten tot zover worden
doorgenomen. Ook zal elke twee weken een voortgangsrapport worden ingevuld.
3.1 Fase I - Oriëntatie (Week 1 & 2) Tijdens de oriëntatie fase wordt er gekeken naar bestaande technieken die gebruikt kunnen worden om een graaf efficiënt maar ook duidelijk te kunnen afbeelden, rekening houdende met het gegeven dat de kanten vanuit een knoop meerdere orden van grootte kunnen verschillen, vergeleken met een andere knoop uit de netwerkvisualisatie. Ook zullen we het huidige systeem van CHAINels bestuderen. Deze fase zal worden afgesloten met een oriëntatieverslag.
3.2 Fase II - Ontwerp (Week 3 & 4) Na een oriëntatie van het hoofdprobleem volgt een requirements analyse. In deze analyse zal er gekeken worden welke functionaliteit reeds bestaat en welke functionaliteit beschreven moet worden aan de hand van use-case beschrijvingen.
3.3 Fase III - Implementatie & Testen (Week 5 - 8) Tijdens de implementatie wordt de functionaliteit geboden in de vorm van programma code, rekening houdende met de bevindingen en ontwerp keuzes uit de voorgaande fasen. Om de correctheid te kunnen waarborgen zullen er ook tests op de code worden uitgevoerd. Een code analyse door de SiG wordt aan het einde van deze fase uitgevoerd.
3.4 Fase IV - Afronding (Week 9 & 10) In deze fase wordt de kritiek op de code van de SiG verwerkt. Naast de code afronding wordt ook het projectverslag afgerond en een eindpresentatie worden voorbereid.
4. Projectinrichting 4.1 Organisatie De organisatie rond het project is als volgt: ● Dr.Ir. S.I. Suddle, CHAINels Opdrachtgever en officieel contactpersoon. ● Erwin Buckers en Vincent Koeman, CHAINels Development team van CHAINels en mede-oprichters en degene waarmee wij het meeste contact zullen hebben. ● ???, TU Delft Projectbegeleider
4.2 Personeel Er zal 10 weken lang fulltime aan het project gewerkt worden. Voornamelijk zullen we gezamenlijk werken en afspreken.
4.3 Rapportering Met Erwin Buckers en Vincent Koeman zullen wij vaak contact mee hebben en bijna dagelijks zien. Daarnaast zal elke week worden afgesproken met de rest van het CHAINels team en de andere groep die hun Bachelorproject bij CHAINels doet. Daar zal worden
getoond wat is gedaan in de voorgaande week en wat zal gebeuren de komende week.
4.4 Resources We zullen vooral werken op de TU Delft. Er wordt geprobeerd door CHAINels om relatief vaste werkplekken/zalen te regelen voor beide projectgroepen. Verder zal er toegang worden geregeld tot de SVN van CHAINels en hun testomgeving.
5. Kwaliteitsborging Door de wekelijkse vergaderingen met het hele team, dus ook het andere projectgroep en de opdrachtgevers, zal de kwaliteit gewaarborgd zijn. Dit komt omdat elke week besproken wordt wat er gedaan moet worden en wat er gedaan is. Ook door makkelijk contact zullen problemen snel besproken en opgelost kunnen worden. De code zal goed gedocumenteerd moeten worden en een goede opmaak hebben, zodat het duidelijk is voor de opdrachtgevers en natuurlijk ook voor ons. De kwaliteit zal ook gewaarborgd worden doordat we de geimplementeerde code opsturen naar SIG voor controle, waarna we de gegeven feedback verwerken. Er is ook een geheimhoudingsverklaring getekend zodat alles vertrouwelijk blijft.
D Orientation Report
D
Orientation Report
47
IN3405 Bachelorproject - Oriëntatieverslag CHAINels - Netwerkvisualisatie
Hans van den Bogert (1307983) André Vollering (1516485)
Christiaan Titos Bolivar (4009010)
2
Inhoudsopgave 1 2 3 3.1 3.2 4 4.1 4.2 4.3 4.4 4.5 5 6
Inleiding . . . . . . . . . . . . . . . . . . CHAINels . . . . . . . . . . . . . . . . . Bedrijfsnetwerken . . . . . . . . . . . . . Relevante visualisaties . . . . . . . . . . Een Goede Recommandatie . . . . . . . Technische Oriëntatie . . . . . . . . . . Soortgelijke Applicaties . . . . . . . . . Mogelijke Programmeertalen . . . . . . Bestaande Frameworks en Libraries . . . Recommandatie Algoritmen . . . . . . . Ontwikkelomgeving en Methode . . . . . Update to Recommendation Algorithm . Conclusie . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
3 4 5 5 5 7 7 9 10 10 11 12 15
1 Inleiding
1
3
Inleiding
De eerste fase van ons project is de oriëntatie fase. In deze fase oriënteren we ons op onze opdracht, de projectomgeving (CHAINels), bekijken we bestaande graaf visualisatie toepassingen en onderzoeken we mogelijke bestaande libraries en frameworks die we kunnen gebruiken. We hebben twee weken voor deze fase gebruikt, waarvan de eerste week ook gebruikt is voor een plan van aanpak te schrijven. In die twee weken hebben we meerdere malen met het CHAINels team afgesproken. Een van die meetings was een brainstorm sessie met Erwin Buckers, Vincent Koeman en Willem Buijs (de designer) over wat zij voor ogen hadden en wat wij voor ideeën hadden. Hier zijn een paar nuttige ideeën uit voort gekomen, zowel voor ons als voor het CHAINels team. In een andere meeting hebben we uitleg gekregen over de broncode van CHAINels en waar en hoe ons project komt te draaien. In dit verslag zullen in Hoofdstuk 2 de relevante onderdelen van het ontwerp en broncode van CHAINels worden uitgelegd. Vervolgens zal in Hoofdstuk 3 worden besproken wat voor een bedrijf relevante visualisaties en goede chain aanbevelingen zijn. Hoofdstuk 4 is vooral technisch van aard, waarin we ingaan op een aantal mogelijke programmeertalen, frameworks, libraries en recomendation algoritmes. In het laatste hoofstuk, hoofdstuk 5, zullen we conclusies trekken aan de hand van onze bevindingen.
2 CHAINels
2
4
CHAINels
CHAINels is geprogrammeerd in object-georiënteerd PHP. De belangrijkste objecten in het systeem zijn Account en Company. Een Company kan meerdere Accounts hebben. Een Company heeft in de database een lijst met Company ID’s (CID’s). Dat zijn dus de chains. Dit is na overleg aangepast. Een Chain is nu een aparte class die waardes heeft als type en een datum van creatie. De database zelf is ook speciaal. Het is een "NoSQL" database genaamd Redis. Er is dus geen traditioneel relationeel schema. Redis werkt met key-value pairs en heeft ook de mogelijkheid om andere datastructuren te gebruiken. Heel veel zullen wij niet met de database in direct contact komen, omdat er al vele functies zijn geschreven door het CHAINels team om data op te vragen en op te slaan. Een andere relevante functionaliteit in CHAINels is het feit dat pagina’s niet op de traditionele manier worden geladen. Normaal gesproken stuurt de server de gehele code in een keer door naar de client. CHAINels maakt echter gebruikt van "pagelets", kleine stukken code die op prioriteit worden verstuurd. Dit verbetert de laadprestaties van de website. Deze techniek heet "Bigpipe", ontwikkeld door Facebook [3]. Onze applicatie zal waarschijnlijk ook in zo’n pagelet geladen worden.
3 Bedrijfsnetwerken
3
5
Bedrijfsnetwerken
In dit hoofdstuk bespreken we de bedrijfstechnische kant van de opdracht. Wat zou relevant zijn voor een bedrijf om te zien in een visualisatie en wat is een goede recommandatie. Dit hoofdstuk is tot stand gekomen door als groepje er over na te denken en een aantal brainstorm sessies met de rest van het CHAINels team.
3.1
Relevante visualisaties
Het representeren van de data kan op meerdere manieren gedaan worden en er kan onderscheid gemaakt worden tussen 2 soorten categorieën: intern en extern. 3.1.1
Intern
Onder intern wordt verstaan dat deze representatie alleen zichtbaar is voor het bedrijf zelf. Dit hebben we gedaan omdat het waarschijnlijk is dat een bedrijf niet alle gegevens wil laten zien aan andere bedrijven, waarbij het niet uitmaakt of ze bevriend zijn of niet. Wat we hier willen laten zien zijn bijvoorbeeld de bedrijfsketens, de recommandaties van bedrijven, een geografische straal/bereik van een bedrijf en statistieken. Dit kunnen er nog meer worden als we op nieuwe ideeën komen of als ergens vraag naar is. Dit betreft ook het wegvallen van functionaliteiten doordat ze niet nuttig zijn. De visualisatie van bedrijfsketens kan gezien worden als een graaf in de vorm van een boom, waarbij de root node het bedrijf zelf is, de chains met bevriende bedrijven gerepresenteerd worden door kanten (edges) en de bedrijven die hierbij horen als opvolgende knopen (nodes). Hierbij kan bijvoorbeeld gekeken worden wat de relatie is van jouw bedrijf met een ander bedrijf. Ook is er een representatie van de recommandaties van bedrijven. Dit zal alle gerecommandeerde bedrijven weergeven op een overzichtelijke manier. Er zal een functionaliteit zijn om het bedrijf gelijk toe te voegen. De geografische straal/bereik van een bedrijf kan worden gevisualiseerd als een kaart waarop de bedrijven worden afgebeeld als knopen (nodes), en hier omheen zal een cirkel worden weer gegeven wat de straal/bereik van een bedrijf is. Dit kan handig zijn om te zien welke bedrijven bereikt kunnen worden en waar de andere bedrijven gelegen zijn. Als laatst zijn er de statistieken. Het is nog een beetje onduidelijk wat we hier allemaal in gaan beschrijven. Het idee achter dit punt is dat hier alle interessante en handige informatie wordt weergegeven, bijvoorbeeld een tijdlijn om te zien hoe het netwerk groeit. Ook kan gedacht worden aan een vraag/aanbod grafiek. 3.1.2
Extern
De visualisaties die onder extern vallen zullen zichtbaar zijn voor iedereen. Bijvoorbeeld overlappende grafen van bedrijven, als bedrijf x bij bedrijf y op zijn profiel kijkt dan kunnen deze twee grafen overlapt worden, waardoor de verschillen en overeenkomsten wat makkelijker gezien kunnen worden. Een extra functionaliteit is dat er ingezoomd kan worden op continenten, landen, staten, provincies, steden, etc. De externe visualisatie zal beperkt zijn ten opzichte van de interne visualisaties omdat de interne waarschijnlijk meer gebruikt zal worden. De externe visualisatie zal meer een leuke toevoeging zijn aan een profielpagina, die er nu wat statisch uit ziet. Het moet uitnodigen om er mee te "spelen".
3.2
Een Goede Recommandatie
Om een goede recommandatie te maken moeten we kijken naar welke dingen een bedrijf belangrijk vindt en wat een bedrijf in een ander bedrijf zoekt. We kwamen al snel op de vraag of we moeten gaan kijken naar bedrijven die veel op elkaar lijken of moeten we dit juist helemaal niet doen, want er ontstaat een probleem als we dit gaan doen. Dit probleem is dat bedrijven die veel op elkaar lijken vaak concurrenten van elkaar zijn, dus zullen ze elkaar minder snel toevoegen als “vriend”.
3 Bedrijfsnetwerken
6
Als ze elkaar zouden toevoegen als vriend dan kan dit leiden tot twee belangrijke dingen, of ze zijn gewoon vrienden en zullen gaan samenwerken, wat goed voor beide bedrijven is. Het kan ook zo zijn dat ze elkaars klanten (bevriende bedrijven waarmee ze samenwerken) overnemen omdat ze nu daartoe makkelijker toegang krijgen, wat natuurlijk goed kan zijn voor de concurrentie en marktwerking. Ondanks dit zal het waarschijnlijk niet iets zijn wat de bedrijven willen, aangezien ze hier de dupe van kunnen zijn. In dit laatste geval kan het ook zo worden dat de chains dan allemaal privé worden (voor het houden van hun clientèle) en dit is iets wat niet ter bevordering werkt van het systeem. Ondanks deze belangrijke vraag die we zullen proberen te beantwoorden door middel van een enquête, zijn we op de volgende belangrijke punten gekomen om bedrijven te ’chainen’. Het eerste onderwerp is de industrie waar ze in zitten. Dit is een belangrijk kenmerk van een bedrijf en zal ook zijn interesses weergeven. Sommige industrieën kunnen beter met elkaar werken dan andere door hun functies en sommige industrieën zullen helemaal niks met elkaar hebben en kunnen hierdoor niet gerecommandeerd worden. Een ander punt is dat we zullen gaan kijken naar de geografie, het gebied waar ze liggen en wat hun bereik is. Ook dit is een logisch punt omdat bedrijven die bij elkaar in de buurt liggen eerder met elkaar zullen samenwerken dan bedrijven die ver van elkaar liggen. Door te kijken naar vraag en aanbod zal het ook mogelijk zijn om bedrijven aan elkaar voor te stellen. Als het ene bedrijf iets aanbiedt en een ander bedrijf dit nodig heeft dan kan dit een begin zijn van een samenwerking. Als laatste punt komt de “soort”-chain. De chains zullen gelabeld worden met een bepaalde tag. Deze tag houd in wat voor soort relatie de bedrijven hebben. Door naar de tags te kijken van chains van vrienden kunnen alleen bedrijven gerecommandeerd worden die de tag vrienden hebben en niet de tag business partner of iets dergelijk. Als we naar al deze punten kijken dan zullen we een goede recommandatie maken.
4 Technische Oriëntatie
4
7
Technische Oriëntatie
In dit hoofdstuk staat de technische kant van de opdracht centraal. We bekijken een aantal soortgelijke applicaties en onderzoeken bestaande frameworks en libraries.
4.1
Soortgelijke Applicaties
We hebben drie verschillende applicaties gevonden die een soortgelijke functie/werking hebben. Deze zullen hieronder besproken worden. Van deze applicaties kunnen we de werking zien en gebruiken in het ontwerp van onze applicatie. De gevonden applicaties zijn: • Google Analytics
• LinkedIn InMaps • Linkfluence
4.1.1
Google Analytics
Deze applicatie van Google is niet speciaal op netwerken gericht, maar is zoals ze zelf zeggen: “de zakelijke webanalyse-oplossing die u het juiste inzicht verschaft in uw websiteverkeer en marketingeffectiviteit.” Dit betekent dat ze inzicht proberen te geven in het netwerkverkeer en in marketing door middel van cijfers en grafieken (zie Figuur 1) [8]. Dit is wat CHAINels ook wil doen met de Netwerkvisualitie applicatie, maar dan niet het marketing gedeelte, dus alleen het inzicht geven van het netwerk op een inzichtelijke manier. Google Analytics is redelijk overzichtelijk en heeft veel verschillende handige functies. Alleen een nadeel is dat, doordat het zoveel functies heeft, het niet voor iedereen makkelijk zal zijn om ermee te beginnen en om het goed te kunnen gebruiken, wat wel de bedoeling is.
Fig. 1: Voorbeeld van Google Analytics. 4.1.2
LinkedIn InMaps
InMaps is een goed voorbeeld van wat CHAINels als eerste idee had hoe de Netwerkvisualisatie moest worden. Het is een speciale applicatie van de netwerk site LinkedIn waarbij door middel van een graaf het netwerk wordt gevisualiseerd [11]. De knopen (nodes) moeten personen of bedrijven voorstellen en de kanten (edges) moeten relaties voorstellen, deze relaties kunnen gezien worden als een geaccepteerd vriendschapsverzoek (Figuur 2). Het ziet er op het eerste gezicht goed uit
4 Technische Oriëntatie
8
door het gebruik van kleuren, maar de enorme hoeveelheid aan edges maakt het onoverzichtelijk en verder is er ook niet veel informatie uit de graaf te halen. Deze applicatie is gebaseerd op Gephi, een programma dat gebruikt kan worden om grafen te visualiseren en zal besproken worden in Paragraaf 4.3. Om de gemaakte map te visualiseren op een website, is gebruik gemaakt van Seadragon en RaphaelJS. Doordat er hiervan gebruik gemaakt is, is InMaps alleen compatible met de nieuwere browsers. Onder de nieuwere browsers wordt Chrome, Safari, Firefox en IE7+ verstaan[10].
Fig. 2: Voorbeeld van een gegenereerde map met InMaps van LinkedIn. 4.1.3
Linkfluence
Linkfluence is een bedrijf dat zich specialiseert op het analyseren van data in social network sites en geeft feedback aan de hand van verzamelde informatie [12]. Ze leveren bijvoorbeeld services van het mappen van de verzamelde data op een graaf. Ook Linkfluence gebruikt Gephi voor het initialiseren van de maps. De feedback wordt geleverd in de vorm van een krant waarbij de grafen als plaatjes worden weergegeven en dus niet interactief zijn. Dit heeft als voordeel dat het zo goed als overal geopend kan worden. 4.1.4
Vizster
Vizster [16] is een interactive tool voor online sociale netwerken. Het is gemaakt om data te verkennen door middel van een sociograaf (een graaf waarbij de structuur van relaties in een groepssituatie weergegeven wordt) [19]. De grafen die gemaakt worden met deze tool zijn overzichtelijk en werken intuïtief. In tegenstelling tot de andere soortgelijke applicaties hebben ze een handige en uitgebreide optie om te filteren op bepaalde features. Ze hebben ook verschillende opties om de graaf te bekijken en te visualiseren.
4 Technische Oriëntatie
9
Fig. 3: Voorbeeld van een feature van Vizster.
4.1.5
4.2 4.2.1
Overige
Mogelijke Programmeertalen PHP
PHP is een server-side scripting-taal die zoals in Hoofdstuk 2 is beschreven is gebruikt om CHAINels mee te programmeren. De eerste gedachte is dus om ook PHP te gebruiken voor ons project. Bij de netwerkvisualisatie is een key-feature de interactiviteit. Dat maakt het meteen lastig om een server-side taal te gebruiken, aangezien alles al is gecreëerd bij het laden van de pagina en er dus een statische afbeelding overblijft. Je zult een client-side taal moeten gebruiken om interactiviteit te bewerkstelligen. Voor het opvragen van de benodigde data kan PHP wel worden gebruikt. Het chain recomendation algoritme kan wel in PHP worden geprogrammeerd. Aangezien CHAINels in object-georienteerd PHP is geschreven, ligt het voor de hand om in dezelfde taal te programmeren. 4.2.2
Adobe Flash
Adobe Flash is een multimedia platform om video, games en applicaties weer te geven op websites. Via ActionScript wordt er geprogrammeerd voor Flash. Flash is zeer geschikt voor interactieve applicaties op het internet. Echter is Flash steeds minder populair aan het worden [21]. Er verschijnen steeds weer nieuwe beveiligingslekken en veel software- en hardwaremakers stoppen met het supporten van Flash. Vooral mobiele operating systems (iOS, Android) stappen steeds sneller over en het mobiele besturingssysteem van Apple, iOS, heeft in zijn geheel geen ondersteuning voor Flash. Ook Youtube is hun Flash-video speler aan het vervangen door een HTML5 speler. Aangezien CHAINels zich op de toekomst richt, vinden we dat Flash geen goede keuze is om de netwerkvisualisatie in te programmeren. 4.2.3
JavaScript
JavaScript is een general-purpose scripting taal die voor allerlei doeleinden (vooral op websites) wordt gebruikt. jQuery is een zeer populaire JavaScript library die o.a. event-handeling, animaties en Ajax interacties makkelijker en duidelijker maakt. Het is dus volledig client-side en zeer geschikt voor interactieve applicaties, alhoewel er wel eerst een element op de pagina moet zijn dat je kunt bewerken. Daar is het volgende kopje geschikt voor.
4 Technische Oriëntatie
4.2.4
10
HTML5
HTML5 is de nieuwste versie van de HTML-standaard. In deze nieuwe standaard zijn vooral van belang het nieuwe