SlideShare uma empresa Scribd logo
1 de 6
_             _                                  _     ____
    /     ___ ___(_) __ _ _ __ _ __ ___    ___ _ __ | |_ | ___|
  / _  / __/ __| |/ _` | '_ | '_ ` _  / _  '_ | __| |___ 
 / ___ __ __  | (_| | | | | | | | | | __/ | | | |_      ___) |
/_/     ____/___/_|__, |_| |_|_| |_| |_|___|_| |_|__| |____/
                     |___/
CS 2110 CODING COMPETITION 2009 ENTRY           by Mengxiang and Chuck

-====================================================================-

Table of Contents
-----------------
1.) Philosophy
2.) Caching
3.) GraphViz
4.) Root-Finding Algorithm
5.) Multithreading (!)
6.) Fibonacci Heap
7.) Prim's Algorithm
8.) Testing
9.) Conclusion and Future Work

-====================================================================-
 - PHILOSOPHY -
-====================================================================-

Our philosophy behind this project was to emphasize performance, while
still guaranteeing accurate results. We took several approaches toward
achieving this goal. We owe a few clever tricks to our vast speed-up,
which we will discuss thoroughly in this "Read Me" file. Thus, the
need for speed and our own unrelenting competitive spirit were our
motivations for developing this project.

We hope you enjoy reviewing our entry as much as we did creating it!

-====================================================================-
 - CACHING -
-====================================================================-

The first and perhaps most obvious approach we took toward making the
program run faster was to add caching of the gene and animal distances.
When a distance is computed, we first check if had already been computed.
If it had been, then we used a hash table to look it up in O(1) time.
If it hadn't been already computed, we compute it by hand, and then we
store it in the hash table for future re-use. Moreover, we use the gene
and animal pairs as indexes to the hash table. Java's built-in hash table
functionality sufficed for this task. We realize that this speed comes
at the cost of memory, but the performance gains made this trade-off well
worth it. Before caching was added, it took the program about 1.5 hours
to generate 40 graphs. Afterward, it took the program about ten seconds.
Reducing the complexity down to O(1) really does pay off.

-====================================================================-
 - GRAPH VIZ -
-====================================================================-

We leveraged the Graph Viz software to provide a springboard that would
hopefully launch us toward success in implementing our root-finding
algorithm. We wrote a GraphViz class that would generate a GraphViz
output file, similar to the Dendroscope and TreePrinter classes. We then
generated phylogenomic graphs for all 40 animals as roots, and began to
analyze the characteristics of these graphs in order to find some sort
of metric to determine the best root animal. Example JPEG graphs and
their corresponding Graph Viz source code are provided in this .ZIP file.
-====================================================================-
 - Root-Finding Algorithm -
-====================================================================-

We immediately noticed the aesthetics of the Parmesianian graph, which
according to the online assignment, was the best root. The tree was much
wider than tall, so initially we figured we could simply use the width
of the tree as a way to determine the root. Unfortunately, several trees
had the same width as the Parmesianian, resulting in ties for the best
root. Moreover, the algorithm was not deterministic because the ties
were not resolved in any defined manner, so any tree with the same width
could potentially be resolved as the best root tree.

Our next attempt was to consider the width-to-height ratio of the graphs.
The tree with the greatest width-to-height ratio appeared to clearly be
the one with the Parmesianian root. However, this technique suffered the
same fate as the previous one; several ties with the same ratio clashed
for becoming the ideal root animal, and the algorithm was apparently not
deterministic either.

We had one last characteristic to consider, though. The tree was much
more "balanced" for the Parmesianian than any other animal. We sought
out for a mathematical definition of balance. We needed a definitive way
to quantitatively measure the balance of a given tree. We scoured the
Internet relentlessly for a way to measure the balance of a tree to no
avail so we began to brainstorm on our own.

The first method up for consideration was to utilize the fact that
binary trees increase by powers of two for each level. Thus, n-ary trees
must increase by some power of n for each level. The closer the tree is
to being balanced, the more evident this relationship holds. The main
problem of this approach though is, what is n? Should n be the same for
all trees? What if one n is better for one graph and another n is better
for another graph? This method left us with even more unanswered questions
so we looked for an alternative.

Finally, the method we chose to implement was that of a recursive algorithm.
We realized that balanced trees have equal amounts of children on each side.
In order to determine just how balanced a tree is, we compute a so-called
"mirror index" recursively that takes this into consideration. The mirror
index algorithm traverses each sub-tree of a given node, counts its children,
and adds the differences between this sub-tree and the other sub-trees at
that level to the mirror index. Then the algorithm recurses to the next level
and counts the "sub-sub-trees", adding the differences in children to the
index accordingly. The algorithm worked! As you can see below, the mirror
index is by far the lowest for the Parmesanian animal:

Frilly_Sea_Sprat: 70
Asian_Boxing_Lobster: 292
Policle: 330
Jelly_Belly: 198
Ballards_Hooting_Crane: 262
Pompous_Snark: 262
Fuzzy_Trible: 174
Sextopus: 356
Gilligans_Squimp: 292
Ballards_ProtoDuck: 222
Shy_Frecklepuss: 216
Bards_Star: 292
Larval_TreeNymph: 192
Globe_Floater: 216
Snuffling_Blat: 152
Big-Billed_Peacock: 262
Spotted_Ghila: 216
Gray_Floop: 216
Leaping_Lizard: 152
Sprats_Butterfly: 192
Munkles_Mouse: 330
Strats_Squirrel: 262
Biscuit: 262
Nocturnal_Mourningbird: 152
Green_Herring: 286
Nocturnal_Plexum: 262
Green_SnapDragon: 222
Common_Mudfly: 262
Striped_Salamander: 286
Paradise_Rockfish: 216
Hairy_Rock_Snot: 58
Darwins_Tortle: 292
Hallucigenia: 292
Parmesanian: 24
Swamp_Slime: 140
Pink_Ziffer: 286
Toothy_Ballonfish: 216
Elephant_Snark: 152
Translucent_Tridle: 88

-====================================================================-
 - Multithreading (!) -
-====================================================================-

We decided to go off on a limb here and do multithreading. After all, now that
Moore's Law is quieting down and we are approaching the physical limits of what
good-old silicon transistor CPUs can actually do as far as clock speed goes, the
chip manufacturers still want to release innovative products so their idea is,
"Throw more cores on it!" Unfortunately, computer scientists haven't figured out
how to completely take advantage of having additional cores yet and
parallelization
is still an active research topic. PC games such as Crysis and the Source Engine
have only recently added support for multithreading to their 3D engines. Now,
GPUs
are being utilized computationally for the same purpose: parallelization.

We were admittedly tired of seeing the CPU usage in the Windows task manager
only
going up to 50% on my dual-core Thinkpad laptop especially back when generating
graphs took a really long time before optimizations were put in place. We wanted
to desperately double the speed of the program by using near 100% CPU usage the
entire time, and we were inspired by Professor Birman's lecture on
multithreading.

We decided to jump on the bandwagon here and implement the gene algorithms in
parallelizable form. In order to do this, we divided our program up into three
stages that must run in serial: the distance computations, the animal species
graph generation, and the root finding algorithm. We then wrote multithreaded
implementations of these algorithms. Luckily, the computations were extremely
well-suited for parallelization; the algorithms could work on different animals
at the same time since the data is completely independent of itself.

We wrote a ThreadManager class (not sure if this fits some kind of design
pattern)
that dispatches out worker threads with allocated workloads that work in tandem
to
accomplish the three serial tasks in parallel. We ran into the problem of
deciding
when each thread is done. Our solution was to put the main thread to sleep and
periodically "wake up" to poll the other threads to see if they were completed
every 100 ms. We realize that Java has built-in notification/wait functionality
for threads, but alas we ran out of time now with only three hours to go before
deadline. We needed a way to make sure that all of the threads were completely
finished before moving onto the next serial task so we implemented our own
Semaphore class with atomic operations for increasing and decreasing the
semaphore
count (P and V).

Multithreading code can be hard to debug. We discovered this ourselves the hard
way with this assignment. The programmer's mantra of "code for an hour, debug
for a week" rang quite true for us. We first ran into problems with Java's
built-in HashMap class not being thread-safe. Luckily for us, Java comes
equipped
with a ConcurrentHashMap class that alleviates these issues. Switching to the
concurrent step-child of HashMap was not difficult at all. We also ran into
dead-
locks and even more thread safety issues. HashSet just wasn't cooperating with
us
and caused the threads to deadlock halfway through. We took advantage of Java's
"synchronized" keyword to make the Phylogeny tree generation code run atomically
in each thread, and this resolved our deadlocking issues. At times, the
frustration became such that we almost gave up on the idea of using threads, but
we finally managed to work out all the bugs and come up with a parallel
implementation of the project.

-====================================================================-
 - Fibonacci Heap -
-====================================================================-

The online assignment web page suggested that if we were truly crazy, we could
use our own Fibonacci Heap implementation to generate the MST tree. Since we
are, in fact, self-professedly crazy, we thought, "Sure! Why not?" This task did
not prove to be nearly as easy as we thought it would be. The Wikipedia page
article
was vague in explaining the Fibonacci Tree operations, so we had to sort of
reverse
engineer the diagrams on there. Moreover, we ran into problems with having
marked
root nodes and our alpha version of the implementation frequently violated the
heap invariant. JUnit testing came to the rescue here, and we were able to
work out all of the bugs and reap the benefits of using a Fibonacci Heap. We
wrote
our own Priority Queue implementation that took advantage of this Fibonacci Heap
class and used it for our next task: using Prim's algorithm.

-====================================================================-
 - Prim's Algorithm -
-====================================================================-

The project web page mentioned using Prim's algorithm instead of Assignment 4's
Naive MST algorithm. We figured, what better way to put our Fibonacci heap
implementation to use?

Prim's algorithm turned out to be a bit more of a challenge than we had thought.
We ended up having to rebuild the PriorityQueue after each iteration to reflect
the new distances of the animals that were going to be added. Moreover, we had
to scrap our code three times and rewrite it because things just were not
working properly.

Implementing the lexicographic tie breaker turned out to be the hardest part.
Initially, my Fibonacci Priority Queue class was designed to be more like a
traditional Priority Queue by using numerical priorities instead of comparators,
which we thought of as a clumsy solution. Nevertheless, we had to resort to
using generics and supporting the comparator interface in our code albeit at
the cost of more bloated, more complex code. Once we switched to using the
comparator interface, we merely had to write our own comparison routine to
check for and break ties.

We thought of two ways of getting around Professor Birman's siblings infinite
distance "hack". The first was to use some sort of look-up table where we
could tell instantly if animals were siblings, and then do some sort of clever
work-around if they were in the PhylogenyTree Prim's algorithm implementation.
The second idea was the one we ended up using. When we build the Phylogeny
tree, we first ignore all siblings when looking for a closest distance for the
minimum spanning tree. If we cannot find a node because they are all siblings,
then we check for the first sibling and use that instead for the animal with
the closest distance. The end result is that we no longer need to set the
siblings' distances to the ad hoc infinity value that was needed before, and
we still get exactly the same MST.

The resulting Prim's algorithm implementation seemed to be a lot more stream-
lined than we had expected. It was clearly cleaner than the Naive MST-building
algorithm and about half the length in code.

-====================================================================-
 - TESTING -
-====================================================================-

Our attitude toward testing was "test early and test often". Thus we devised
as many tests as we could to try to break our program. We were successful in
many instances, which helped to improve the stability of our code. Throughout
the project, we used the Subversion version control system that Chuck had
installed on his OpenBSD box at home to help aid in collaboration. Eclipse
even had a plugin that allowed it to use SVN as a development tool. This
allowed us to simultaneously write tests and run them in hopes of discovering
bugs. We came up with some pretty cool ideas for tests!

Multithreading necessitated a unique kind of testing we called "stress
testing". The idea was throw 20 threads in the ring and have them duke it out
and try to deadlock each other or reveal any race conditions. For the latter,
we repeated the test for multiple trials and checked to make sure the root
animal was the same each time. This spot checking turned out to be very useful
for detecting small variations in the tree. Such variations would manifest
themselves later on in the root finding algorithm, yielding completely
different results.

Furthermore, the Fibonacci heap needed thorough testing if we were going to
boldly replace Java's venerable PriorityQueue class with our own hack. Our
best idea was to try constructing multiple random heaps and perform our
own random set of operations on them, checking the heap invariant after every
one. This test proved to be quite effective. Many hidden bugs lurking within
the heap implementation were swiftly and surely brought to light by this
test. As a result, we gained some confidence that our own heap solution
was worthy enough to contend with Sun's (wishful thinking! :).

Menxiang wrote many of the rote validity tests in the code. They test the
methods for correctness and fault tolerance.

-====================================================================-
 - CONCLUSION AND FUTURE WORK -
-====================================================================-

Time is a scarce resource at Cornell. Some of our most ambitious ideas did
not make it into the final product, but that can be said about many project
life cycles in the real world. We had thought of writing a 3D OpenGL tree
visualization tool for the GUI, but we were one day short of actually
including it in our project. JOGL would have facilitated this, along with
prior experience with OpenGL in other projects.
Also, we thought Birman's cloud/distributing computing stuff was pretty neat
and were wondering if we could somehow dispatch our threads on other machines
using Java's web services functionality. Unfortunately, multithreading alone
proved to be ambitious enough, and we were not able to implement this, but
hey, it's still a pretty cool idea nonetheless to crunch out large DNA data
sets "in the cloud" much like how protein folding is being carried out
nowadays.

Overall, we thought the project was pretty successful. Our greatest triumph
was hands down the multithreading, but all in all, the rest of the project
went just as smoothly and we seemed to work quite nicely toward
accomplishing our goals here even if it meant being overly ambitious at times!

    ("`-''-/").___..--''"`-._
     `6_ 6 )    `-. (      ).`-.__.`)
     (_Y_.)' ._    ) `._ `. ``-..-'
   _..`--'_..-_/ /--'_.' ,'
  (il).-'' (li).' ((!.-'

Mais conteúdo relacionado

Destaque

Destaque (10)

Speaker Bracket
Speaker BracketSpeaker Bracket
Speaker Bracket
 
Aws slide share-mockup v2
Aws slide share-mockup v2Aws slide share-mockup v2
Aws slide share-mockup v2
 
Cs665 writeup
Cs665 writeupCs665 writeup
Cs665 writeup
 
C4 cảm xúc và cám dỗ copy
C4 cảm xúc và cám dỗ   copyC4 cảm xúc và cám dỗ   copy
C4 cảm xúc và cám dỗ copy
 
Alpha releasepresentation
Alpha releasepresentationAlpha releasepresentation
Alpha releasepresentation
 
Clmkt copy - copy
Clmkt   copy - copyClmkt   copy - copy
Clmkt copy - copy
 
Aws slide share-mockup
Aws slide share-mockupAws slide share-mockup
Aws slide share-mockup
 
Feliz cumple
Feliz cumpleFeliz cumple
Feliz cumple
 
Presentation slides
Presentation slidesPresentation slides
Presentation slides
 
Polysaccharide
PolysaccharidePolysaccharide
Polysaccharide
 

Semelhante a CS 2110 Programming Competition Entry Readme

Deep Learning with Apache MXNet (September 2017)
Deep Learning with Apache MXNet (September 2017)Deep Learning with Apache MXNet (September 2017)
Deep Learning with Apache MXNet (September 2017)Julien SIMON
 
Arules_TM_Rpart_Markdown
Arules_TM_Rpart_MarkdownArules_TM_Rpart_Markdown
Arules_TM_Rpart_MarkdownAdrian Cuyugan
 
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017StampedeCon
 
Recipe2Vec: Or how does my robot know what’s tasty
Recipe2Vec: Or how does my robot know what’s tastyRecipe2Vec: Or how does my robot know what’s tasty
Recipe2Vec: Or how does my robot know what’s tastyPyData
 
OOScss Architecture For Rails Apps
OOScss Architecture For Rails AppsOOScss Architecture For Rails Apps
OOScss Architecture For Rails AppsNetguru
 
51881801 informatica-faq
51881801 informatica-faq51881801 informatica-faq
51881801 informatica-faqVenkat485
 
Quantitative Methods for Lawyers - Class #14 - R Boot Camp - Part 1 - Profess...
Quantitative Methods for Lawyers - Class #14 - R Boot Camp - Part 1 - Profess...Quantitative Methods for Lawyers - Class #14 - R Boot Camp - Part 1 - Profess...
Quantitative Methods for Lawyers - Class #14 - R Boot Camp - Part 1 - Profess...Daniel Katz
 
Facial Expression Recognition via Python
Facial Expression Recognition via PythonFacial Expression Recognition via Python
Facial Expression Recognition via PythonSaurav Gupta
 
Predicting Facial Expression using Neural Network
Predicting Facial Expression using Neural Network Predicting Facial Expression using Neural Network
Predicting Facial Expression using Neural Network Santanu Paul
 
Certification Study Group - Professional ML Engineer Session 3 (Machine Learn...
Certification Study Group - Professional ML Engineer Session 3 (Machine Learn...Certification Study Group - Professional ML Engineer Session 3 (Machine Learn...
Certification Study Group - Professional ML Engineer Session 3 (Machine Learn...gdgsurrey
 
From grep to BERT
From grep to BERTFrom grep to BERT
From grep to BERTQAware GmbH
 
Deep Learning for Developers (Advanced Workshop)
Deep Learning for Developers (Advanced Workshop)Deep Learning for Developers (Advanced Workshop)
Deep Learning for Developers (Advanced Workshop)Amazon Web Services
 
What You Missed in Computer Science
What You Missed in Computer ScienceWhat You Missed in Computer Science
What You Missed in Computer ScienceTaylor Lovett
 
MongoDB & Machine Learning
MongoDB & Machine LearningMongoDB & Machine Learning
MongoDB & Machine LearningTom Maiaroto
 
Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...
Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...
Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...Spark Summit
 
Good practices for PrestaShop code security and optimization
Good practices for PrestaShop code security and optimizationGood practices for PrestaShop code security and optimization
Good practices for PrestaShop code security and optimizationPrestaShop
 
Advance Sql Server Store procedure Presentation
Advance Sql Server Store procedure PresentationAdvance Sql Server Store procedure Presentation
Advance Sql Server Store procedure PresentationAmin Uddin
 

Semelhante a CS 2110 Programming Competition Entry Readme (20)

Deep Learning with Apache MXNet (September 2017)
Deep Learning with Apache MXNet (September 2017)Deep Learning with Apache MXNet (September 2017)
Deep Learning with Apache MXNet (September 2017)
 
Arules_TM_Rpart_Markdown
Arules_TM_Rpart_MarkdownArules_TM_Rpart_Markdown
Arules_TM_Rpart_Markdown
 
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
 
Recipe2Vec: Or how does my robot know what’s tasty
Recipe2Vec: Or how does my robot know what’s tastyRecipe2Vec: Or how does my robot know what’s tasty
Recipe2Vec: Or how does my robot know what’s tasty
 
OOScss Architecture For Rails Apps
OOScss Architecture For Rails AppsOOScss Architecture For Rails Apps
OOScss Architecture For Rails Apps
 
51881801 informatica-faq
51881801 informatica-faq51881801 informatica-faq
51881801 informatica-faq
 
Quantitative Methods for Lawyers - Class #14 - R Boot Camp - Part 1 - Profess...
Quantitative Methods for Lawyers - Class #14 - R Boot Camp - Part 1 - Profess...Quantitative Methods for Lawyers - Class #14 - R Boot Camp - Part 1 - Profess...
Quantitative Methods for Lawyers - Class #14 - R Boot Camp - Part 1 - Profess...
 
Facial Expression Recognition
Facial Expression RecognitionFacial Expression Recognition
Facial Expression Recognition
 
Facial Expression Recognition via Python
Facial Expression Recognition via PythonFacial Expression Recognition via Python
Facial Expression Recognition via Python
 
Predicting Facial Expression using Neural Network
Predicting Facial Expression using Neural Network Predicting Facial Expression using Neural Network
Predicting Facial Expression using Neural Network
 
Certification Study Group - Professional ML Engineer Session 3 (Machine Learn...
Certification Study Group - Professional ML Engineer Session 3 (Machine Learn...Certification Study Group - Professional ML Engineer Session 3 (Machine Learn...
Certification Study Group - Professional ML Engineer Session 3 (Machine Learn...
 
From grep to BERT
From grep to BERTFrom grep to BERT
From grep to BERT
 
Deep Learning for Developers (Advanced Workshop)
Deep Learning for Developers (Advanced Workshop)Deep Learning for Developers (Advanced Workshop)
Deep Learning for Developers (Advanced Workshop)
 
What You Missed in Computer Science
What You Missed in Computer ScienceWhat You Missed in Computer Science
What You Missed in Computer Science
 
MongoDB & Machine Learning
MongoDB & Machine LearningMongoDB & Machine Learning
MongoDB & Machine Learning
 
Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...
Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...
Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...
 
Good practices for PrestaShop code security and optimization
Good practices for PrestaShop code security and optimizationGood practices for PrestaShop code security and optimization
Good practices for PrestaShop code security and optimization
 
Hadoop Interview Questions and Answers
Hadoop Interview Questions and AnswersHadoop Interview Questions and Answers
Hadoop Interview Questions and Answers
 
Advance Sql Server Store procedure Presentation
Advance Sql Server Store procedure PresentationAdvance Sql Server Store procedure Presentation
Advance Sql Server Store procedure Presentation
 
Algorithm
AlgorithmAlgorithm
Algorithm
 

Mais de Chuck Moyes

Concept documentfinal
Concept documentfinalConcept documentfinal
Concept documentfinalChuck Moyes
 
Beta presentation
Beta presentationBeta presentation
Beta presentationChuck Moyes
 
Manual small (1)
Manual small (1)Manual small (1)
Manual small (1)Chuck Moyes
 
Ece4760 progress report2
Ece4760 progress report2Ece4760 progress report2
Ece4760 progress report2Chuck Moyes
 
Ece4760 progess report1
Ece4760 progess report1Ece4760 progess report1
Ece4760 progess report1Chuck Moyes
 
Ece lab5 proposal
Ece lab5 proposalEce lab5 proposal
Ece lab5 proposalChuck Moyes
 

Mais de Chuck Moyes (9)

Concept documentfinal
Concept documentfinalConcept documentfinal
Concept documentfinal
 
Beta presentation
Beta presentationBeta presentation
Beta presentation
 
Manual small (1)
Manual small (1)Manual small (1)
Manual small (1)
 
Ece4760 progress report2
Ece4760 progress report2Ece4760 progress report2
Ece4760 progress report2
 
Ece4760 hw4
Ece4760 hw4Ece4760 hw4
Ece4760 hw4
 
Ece4760 progess report1
Ece4760 progess report1Ece4760 progess report1
Ece4760 progess report1
 
Ece lab5 proposal
Ece lab5 proposalEce lab5 proposal
Ece lab5 proposal
 
Fb graph
Fb graphFb graph
Fb graph
 
Ai plan
Ai planAi plan
Ai plan
 

Último

SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 

Último (20)

SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 

CS 2110 Programming Competition Entry Readme

  • 1. _ _ _ ____ / ___ ___(_) __ _ _ __ _ __ ___ ___ _ __ | |_ | ___| / _ / __/ __| |/ _` | '_ | '_ ` _ / _ '_ | __| |___ / ___ __ __ | (_| | | | | | | | | | __/ | | | |_ ___) | /_/ ____/___/_|__, |_| |_|_| |_| |_|___|_| |_|__| |____/ |___/ CS 2110 CODING COMPETITION 2009 ENTRY by Mengxiang and Chuck -====================================================================- Table of Contents ----------------- 1.) Philosophy 2.) Caching 3.) GraphViz 4.) Root-Finding Algorithm 5.) Multithreading (!) 6.) Fibonacci Heap 7.) Prim's Algorithm 8.) Testing 9.) Conclusion and Future Work -====================================================================- - PHILOSOPHY - -====================================================================- Our philosophy behind this project was to emphasize performance, while still guaranteeing accurate results. We took several approaches toward achieving this goal. We owe a few clever tricks to our vast speed-up, which we will discuss thoroughly in this "Read Me" file. Thus, the need for speed and our own unrelenting competitive spirit were our motivations for developing this project. We hope you enjoy reviewing our entry as much as we did creating it! -====================================================================- - CACHING - -====================================================================- The first and perhaps most obvious approach we took toward making the program run faster was to add caching of the gene and animal distances. When a distance is computed, we first check if had already been computed. If it had been, then we used a hash table to look it up in O(1) time. If it hadn't been already computed, we compute it by hand, and then we store it in the hash table for future re-use. Moreover, we use the gene and animal pairs as indexes to the hash table. Java's built-in hash table functionality sufficed for this task. We realize that this speed comes at the cost of memory, but the performance gains made this trade-off well worth it. Before caching was added, it took the program about 1.5 hours to generate 40 graphs. Afterward, it took the program about ten seconds. Reducing the complexity down to O(1) really does pay off. -====================================================================- - GRAPH VIZ - -====================================================================- We leveraged the Graph Viz software to provide a springboard that would hopefully launch us toward success in implementing our root-finding algorithm. We wrote a GraphViz class that would generate a GraphViz output file, similar to the Dendroscope and TreePrinter classes. We then generated phylogenomic graphs for all 40 animals as roots, and began to analyze the characteristics of these graphs in order to find some sort of metric to determine the best root animal. Example JPEG graphs and their corresponding Graph Viz source code are provided in this .ZIP file.
  • 2. -====================================================================- - Root-Finding Algorithm - -====================================================================- We immediately noticed the aesthetics of the Parmesianian graph, which according to the online assignment, was the best root. The tree was much wider than tall, so initially we figured we could simply use the width of the tree as a way to determine the root. Unfortunately, several trees had the same width as the Parmesianian, resulting in ties for the best root. Moreover, the algorithm was not deterministic because the ties were not resolved in any defined manner, so any tree with the same width could potentially be resolved as the best root tree. Our next attempt was to consider the width-to-height ratio of the graphs. The tree with the greatest width-to-height ratio appeared to clearly be the one with the Parmesianian root. However, this technique suffered the same fate as the previous one; several ties with the same ratio clashed for becoming the ideal root animal, and the algorithm was apparently not deterministic either. We had one last characteristic to consider, though. The tree was much more "balanced" for the Parmesianian than any other animal. We sought out for a mathematical definition of balance. We needed a definitive way to quantitatively measure the balance of a given tree. We scoured the Internet relentlessly for a way to measure the balance of a tree to no avail so we began to brainstorm on our own. The first method up for consideration was to utilize the fact that binary trees increase by powers of two for each level. Thus, n-ary trees must increase by some power of n for each level. The closer the tree is to being balanced, the more evident this relationship holds. The main problem of this approach though is, what is n? Should n be the same for all trees? What if one n is better for one graph and another n is better for another graph? This method left us with even more unanswered questions so we looked for an alternative. Finally, the method we chose to implement was that of a recursive algorithm. We realized that balanced trees have equal amounts of children on each side. In order to determine just how balanced a tree is, we compute a so-called "mirror index" recursively that takes this into consideration. The mirror index algorithm traverses each sub-tree of a given node, counts its children, and adds the differences between this sub-tree and the other sub-trees at that level to the mirror index. Then the algorithm recurses to the next level and counts the "sub-sub-trees", adding the differences in children to the index accordingly. The algorithm worked! As you can see below, the mirror index is by far the lowest for the Parmesanian animal: Frilly_Sea_Sprat: 70 Asian_Boxing_Lobster: 292 Policle: 330 Jelly_Belly: 198 Ballards_Hooting_Crane: 262 Pompous_Snark: 262 Fuzzy_Trible: 174 Sextopus: 356 Gilligans_Squimp: 292 Ballards_ProtoDuck: 222 Shy_Frecklepuss: 216 Bards_Star: 292 Larval_TreeNymph: 192 Globe_Floater: 216 Snuffling_Blat: 152 Big-Billed_Peacock: 262
  • 3. Spotted_Ghila: 216 Gray_Floop: 216 Leaping_Lizard: 152 Sprats_Butterfly: 192 Munkles_Mouse: 330 Strats_Squirrel: 262 Biscuit: 262 Nocturnal_Mourningbird: 152 Green_Herring: 286 Nocturnal_Plexum: 262 Green_SnapDragon: 222 Common_Mudfly: 262 Striped_Salamander: 286 Paradise_Rockfish: 216 Hairy_Rock_Snot: 58 Darwins_Tortle: 292 Hallucigenia: 292 Parmesanian: 24 Swamp_Slime: 140 Pink_Ziffer: 286 Toothy_Ballonfish: 216 Elephant_Snark: 152 Translucent_Tridle: 88 -====================================================================- - Multithreading (!) - -====================================================================- We decided to go off on a limb here and do multithreading. After all, now that Moore's Law is quieting down and we are approaching the physical limits of what good-old silicon transistor CPUs can actually do as far as clock speed goes, the chip manufacturers still want to release innovative products so their idea is, "Throw more cores on it!" Unfortunately, computer scientists haven't figured out how to completely take advantage of having additional cores yet and parallelization is still an active research topic. PC games such as Crysis and the Source Engine have only recently added support for multithreading to their 3D engines. Now, GPUs are being utilized computationally for the same purpose: parallelization. We were admittedly tired of seeing the CPU usage in the Windows task manager only going up to 50% on my dual-core Thinkpad laptop especially back when generating graphs took a really long time before optimizations were put in place. We wanted to desperately double the speed of the program by using near 100% CPU usage the entire time, and we were inspired by Professor Birman's lecture on multithreading. We decided to jump on the bandwagon here and implement the gene algorithms in parallelizable form. In order to do this, we divided our program up into three stages that must run in serial: the distance computations, the animal species graph generation, and the root finding algorithm. We then wrote multithreaded implementations of these algorithms. Luckily, the computations were extremely well-suited for parallelization; the algorithms could work on different animals at the same time since the data is completely independent of itself. We wrote a ThreadManager class (not sure if this fits some kind of design pattern) that dispatches out worker threads with allocated workloads that work in tandem to accomplish the three serial tasks in parallel. We ran into the problem of deciding when each thread is done. Our solution was to put the main thread to sleep and periodically "wake up" to poll the other threads to see if they were completed
  • 4. every 100 ms. We realize that Java has built-in notification/wait functionality for threads, but alas we ran out of time now with only three hours to go before deadline. We needed a way to make sure that all of the threads were completely finished before moving onto the next serial task so we implemented our own Semaphore class with atomic operations for increasing and decreasing the semaphore count (P and V). Multithreading code can be hard to debug. We discovered this ourselves the hard way with this assignment. The programmer's mantra of "code for an hour, debug for a week" rang quite true for us. We first ran into problems with Java's built-in HashMap class not being thread-safe. Luckily for us, Java comes equipped with a ConcurrentHashMap class that alleviates these issues. Switching to the concurrent step-child of HashMap was not difficult at all. We also ran into dead- locks and even more thread safety issues. HashSet just wasn't cooperating with us and caused the threads to deadlock halfway through. We took advantage of Java's "synchronized" keyword to make the Phylogeny tree generation code run atomically in each thread, and this resolved our deadlocking issues. At times, the frustration became such that we almost gave up on the idea of using threads, but we finally managed to work out all the bugs and come up with a parallel implementation of the project. -====================================================================- - Fibonacci Heap - -====================================================================- The online assignment web page suggested that if we were truly crazy, we could use our own Fibonacci Heap implementation to generate the MST tree. Since we are, in fact, self-professedly crazy, we thought, "Sure! Why not?" This task did not prove to be nearly as easy as we thought it would be. The Wikipedia page article was vague in explaining the Fibonacci Tree operations, so we had to sort of reverse engineer the diagrams on there. Moreover, we ran into problems with having marked root nodes and our alpha version of the implementation frequently violated the heap invariant. JUnit testing came to the rescue here, and we were able to work out all of the bugs and reap the benefits of using a Fibonacci Heap. We wrote our own Priority Queue implementation that took advantage of this Fibonacci Heap class and used it for our next task: using Prim's algorithm. -====================================================================- - Prim's Algorithm - -====================================================================- The project web page mentioned using Prim's algorithm instead of Assignment 4's Naive MST algorithm. We figured, what better way to put our Fibonacci heap implementation to use? Prim's algorithm turned out to be a bit more of a challenge than we had thought. We ended up having to rebuild the PriorityQueue after each iteration to reflect the new distances of the animals that were going to be added. Moreover, we had to scrap our code three times and rewrite it because things just were not working properly. Implementing the lexicographic tie breaker turned out to be the hardest part. Initially, my Fibonacci Priority Queue class was designed to be more like a traditional Priority Queue by using numerical priorities instead of comparators, which we thought of as a clumsy solution. Nevertheless, we had to resort to using generics and supporting the comparator interface in our code albeit at
  • 5. the cost of more bloated, more complex code. Once we switched to using the comparator interface, we merely had to write our own comparison routine to check for and break ties. We thought of two ways of getting around Professor Birman's siblings infinite distance "hack". The first was to use some sort of look-up table where we could tell instantly if animals were siblings, and then do some sort of clever work-around if they were in the PhylogenyTree Prim's algorithm implementation. The second idea was the one we ended up using. When we build the Phylogeny tree, we first ignore all siblings when looking for a closest distance for the minimum spanning tree. If we cannot find a node because they are all siblings, then we check for the first sibling and use that instead for the animal with the closest distance. The end result is that we no longer need to set the siblings' distances to the ad hoc infinity value that was needed before, and we still get exactly the same MST. The resulting Prim's algorithm implementation seemed to be a lot more stream- lined than we had expected. It was clearly cleaner than the Naive MST-building algorithm and about half the length in code. -====================================================================- - TESTING - -====================================================================- Our attitude toward testing was "test early and test often". Thus we devised as many tests as we could to try to break our program. We were successful in many instances, which helped to improve the stability of our code. Throughout the project, we used the Subversion version control system that Chuck had installed on his OpenBSD box at home to help aid in collaboration. Eclipse even had a plugin that allowed it to use SVN as a development tool. This allowed us to simultaneously write tests and run them in hopes of discovering bugs. We came up with some pretty cool ideas for tests! Multithreading necessitated a unique kind of testing we called "stress testing". The idea was throw 20 threads in the ring and have them duke it out and try to deadlock each other or reveal any race conditions. For the latter, we repeated the test for multiple trials and checked to make sure the root animal was the same each time. This spot checking turned out to be very useful for detecting small variations in the tree. Such variations would manifest themselves later on in the root finding algorithm, yielding completely different results. Furthermore, the Fibonacci heap needed thorough testing if we were going to boldly replace Java's venerable PriorityQueue class with our own hack. Our best idea was to try constructing multiple random heaps and perform our own random set of operations on them, checking the heap invariant after every one. This test proved to be quite effective. Many hidden bugs lurking within the heap implementation were swiftly and surely brought to light by this test. As a result, we gained some confidence that our own heap solution was worthy enough to contend with Sun's (wishful thinking! :). Menxiang wrote many of the rote validity tests in the code. They test the methods for correctness and fault tolerance. -====================================================================- - CONCLUSION AND FUTURE WORK - -====================================================================- Time is a scarce resource at Cornell. Some of our most ambitious ideas did not make it into the final product, but that can be said about many project life cycles in the real world. We had thought of writing a 3D OpenGL tree visualization tool for the GUI, but we were one day short of actually including it in our project. JOGL would have facilitated this, along with prior experience with OpenGL in other projects.
  • 6. Also, we thought Birman's cloud/distributing computing stuff was pretty neat and were wondering if we could somehow dispatch our threads on other machines using Java's web services functionality. Unfortunately, multithreading alone proved to be ambitious enough, and we were not able to implement this, but hey, it's still a pretty cool idea nonetheless to crunch out large DNA data sets "in the cloud" much like how protein folding is being carried out nowadays. Overall, we thought the project was pretty successful. Our greatest triumph was hands down the multithreading, but all in all, the rest of the project went just as smoothly and we seemed to work quite nicely toward accomplishing our goals here even if it meant being overly ambitious at times! ("`-''-/").___..--''"`-._ `6_ 6 ) `-. ( ).`-.__.`) (_Y_.)' ._ ) `._ `. ``-..-' _..`--'_..-_/ /--'_.' ,' (il).-'' (li).' ((!.-'