SlideShare uma empresa Scribd logo
1 de 20
Baixar para ler offline
Performance Analysis of Leading Application Lifecycle
    Management Systems for Large Customer Data Environments

                                           Paul Nelson
                  Director, Enterprise Systems Management, AppliedTrust, Inc.
                                      paul@appliedtrust.com

                                       Dr. Evi Nemeth
        Associate Professor Attendant Rank Emeritus, University of Colorado at Boulder
                          Distinguished Engineer, AppliedTrust, Inc.
                                    evi@appliedtrust.com

                                           Tyler Bell
                                   Engineer, AppliedTrust, Inc.
                                     tyler@appliedtrust.com

                                      AppliedTrust, Inc.
                               1033 Walnut St, Boulder, CO 80302
                                        (303) 245-4545




                                            Abstract
The performance of three leading application lifecycle management (ALM) systems (Rally by
Rally Software, VersionOne by VersionOne, and JIRA+GreenHopper by Atlassian) was
assessed to draw comparative performance observations when customer data exceeds a 500,000-
artifact threshold. The focus of this performance testing was how each system handles a
simulated “large” customer (i.e., a customer with half a million artifacts). A near-identical
representative data set of 512,000 objects was constructed and populated in each system in order
to simulate identical use cases as closely as possible. Timed browser testing was performed to
gauge the performance of common usage scenarios, and comparisons were then made. Nine tests
were performed based on measurable, single-operation events.
         Rally emerged as the strongest performer based on the test results, leading outright in six
of the nine that were compared. In one of these six tests, Rally tied with VersionOne from a
scoring perspective in terms of relative performance (using the scoring system developed for
comparisons), though it led from a raw measured-speed perspective. In one test not included in
the six, Rally tied with JIRA+GreenHopper from a numeric perspective and within the bounds of
the scoring model that was established. VersionOne was the strongest performer in two of the
nine tests, and exhibited very similar performance characteristics (generally within a 1 – 12
second margin) in many of the tests that Rally led. JIRA+GreenHopper did not lead any tests, but
as noted, tied with Rally for one. JIRA+GreenHopper was almost an order of magnitude slower
than peers when performing any test that involved its agile software development plug-in. All

                                                 1
applications were able to complete the tests being performed (i.e., no tests failed outright). Based
on the results, Rally and VersionOne, but not JIRA+GreenHopper, appear to be viable solutions
for clients with a large number of artifacts.

1.   Introduction                                          JIRA+GreenHopper JIRA 5.1with
                                                           GreenHopper 6 were the versions that were
As the adoption of agile project management                tested.
has accelerated over the last decade, so too has               The tests measure the performance of
the use of tools supporting this methodology.              single-user, single-operation events when an
This growth has resulted in the accumulation               underlying customer data set made up of
of artifacts (user stories, defects, tasks, and test       500,000 objects is present. These tests are not
cases) by customers in their ALM system of                 intended to be used to draw conclusions
choice. The trend is for data stored in these              regarding other possible scenarios of interest,
systems to be retained indefinitely, as there is           such as load, concurrent users, or other tests
no compelling reason to remove it, and often,              not explicitly described.
product generations are developed and                          The fundamental objective of the testing is
improved over significant periods of time. In              to provide some level of quantitative
other cases, the size of specific customers and            comparison for user-based interaction with the
ongoing projects may result in very rapid                  three products, as opposed to system- or
accumulation of artifacts in relatively short              service-based interaction.
periods of time. Anecdotal reports suggest that
an artifact threshold exists around the 500,000
artifact point, and this paper seeks to test that          2.   Data Set Construction
observation.
     This artifact scaling presents a challenge            The use of ALM software and the variety of
for ALM solution providers, as customers                   artifacts, custom fields, etc., will vary
expect performance consistency in their ALM                significantly between customers. As a result,
platform regardless of the volume of the                   there is not necessarily a “right way” to
underlying data. While it is certainly possible            structure data for test purposes. More
to architect ALM systems to address such                   important is that fields contain content that is
challenges, there are anecdotal reports that               similarly structured to real data (e.g., text in
some major platforms do not currently handle               freeform text fields, dates in date fields), and
large projects in a sufficient manner from a               that each platform is populated with the same
performance perspective.                                   data. In some cases, product variations
     This paper presents the results of testing            prevented this. Rally, for example, does not
performed in August and September 2012,                    use the concept of an epic, but rather a
recording the performance of Rally Software,               hierarchical, user story relationship, whereas
VersionOne, and JIRA+GreenHopper, and                      VersionOne supports epics.
then drawing comparative conclusions                            Actually creating data with unique content
between the three products. Atlassian’s ALM                for all artifacts would be infeasible for testing
offering utilizes its JIRA product and extends             purposes. To model real data, a structure was
it to support agile project management using               chosen for a customer instance based on 10
the GreenHopper functionality extension                    unique projects. Within each project, 40 epics
(referred to in this paper as                              or parent user stories were populated, and 80
JIRA+GreenHopper). Rally Build 7396,                       user stories were created within each of those.
VersionOne 12.2.2.3601, and                                Associated with each user story were 16
                                                           artifacts: 10 tasks, four defects, and two test

                                                       2
cases. In terms of core artifact types, the              This generator produces text containing real
product of these counts is 16*80*40*10, or               sentence and paragraph structures, but random
512,000. All platforms suffered from                     strings as words. A number of paragraph size
difficulties related to data population. This            and content blocks were created, and their use
manifested in a variety of ways, including               was repeated in multiple objects. The
imports “freezing,” data being truncated, or             description field of a story contained one or
data being mismapped to incorrect fields.                two paragraphs of this generated text. Tasks,
Every effort was made to ensure as much data             defects, and tests used one or two sentences. If
consistency between data uploads as possible,            one story got two paragraphs, then the next
but there were slight deviations from the                story would get one paragraph, and so on in
expected norm. This was estimated to be no               rotation. This data model was used for each
more than 5%, and where there was missing                system.
data, supplementary uploads were performed                   It is possible that one or more of the
to move the total artifact count closer to the           products may be able to optimize content
512,000 target. In addition, tests were only             retrieval with an effective indexing strategy,
performed on objects that met consistency                but this advantage is implementable in each
checks (i.e., the same field data).                      product. Only JIRA+GreenHopper prompted
     These symmetrical project data structures           the user to initiate indexing operations, and
are not likely to be seen in real customer               based on prompted instruction, indexing was
environments. The numbers of parent objects              performed after data uploads were complete.
and child objects will also vary considerably.
That being said, a standard form is required to
allow population in three products and to                3.   Data Population
enable attempts at some level of data
consistency. Given that the structure is                 Data was populated primarily by using the
mirrored as closely as possible across each              CSV import functionality offered by each
product, the performance variance should be              system. This process varied in the operation
indicative of observed behaviors in other                sequence and chunking mechanism for
customer environments regardless of the exact            uploads, but fundamentally was based on
artifact distributions.                                  tailoring input files to match the input
     Custom fields are offered by all products,          specifications and uploading a sequence of
and so a number of fields were added and                 files. Out of necessity, files were uploaded in
populated to simulate their use. Five custom             various-sized pieces related to input limits for
fields were added to each story, task, defect,           each system. API calls and scripts were used
and test case; one was Boolean true/false, two           to establish relationships between artifacts
were numerical values, and two were short text           when the CSV input method did not support or
fields.                                                  retain these relationships. We encountered
     The data populated followed the schema              issues with each vendor’s product in importing
specified by each vendor’s documentation. We             such a large data set, which suggests that
populated fields for ID, name, description,              customers considering switching from one
priority, and estimated cost and time to                 product to another should look carefully at the
complete. The data consisted of dates and                feasibility of loading their existing data. Some
times, values from fixed lists (e.g., the priority       of our difficulty in loading data involved the
field with each possible value used in turn),            fact that we wanted to measure comparable
references to other objects (parent ID), and             operations, and the underlying data structures
text generated by a lorem ipsum generator.               made this sometimes easy, sometimes nearly
                                                         impossible.

                                                     3
4.   JIRA+GreenHopper Data Population                  7.   Testing Methodology
     Issues
                                                       A single test system was used to collect test
We had to create a ‘Test Case’ issue type in           data in order to limit bias introduced by
the JIRA+GreenHopper product and use what              different computers and browser instances.
is known in the JIRA+GreenHopper                       The test platform was a Dell Studio XPS 8100
community as a bug to keep track of the                running Microsoft Windows 7 Professional
parent-child hierarchy of data objects. Once           SP1 64-bit, and the browser used to perform
this was done, the data loaded quite smoothly          testing was Mozilla Firefox v15.0.1. The
using CSV files and its import facility until we       Firebug add-on running v1.10.3 was used to
reached the halfway point, when the import             collect test metrics. Timing data was recorded
process slowed down considerably.                      in a data collection spreadsheet constructed for
Ultimately, the data import took two to three          this project. While results are expected to vary
full days to complete.                                 if using other software and version
                                                       combinations, using a standardized collection
                                                       model ensured a consistent, unbiased approach
5.   Rally Data Population Issues                      to gathering test data for this paper, and will
                                                       allow legitimate comparisons to be made. It is
Rally limits the size of CSV files to 1000 lines       expected that while the actual timing averages
and 2.097 MB. It also destroys the                     may differ, the comparisons will not.
UserStory/SubStory hierarchy on import                     At the time measurements were being
(though presents it on export). These                  taken, the measurement machine was the only
limitations led to a lengthy and tedious data          user of our instance of the software products.
population operation. Tasks could not be               All tests were performed using the same
imported using the CSV technique. Instead,             network and Internet connection, with no
scripting was used to import tasks via Rally’s         software updates or changes between tests. To
REST API interface. The script was made                ensure there were no large disparities between
using Pyral, which is a library released by            response times, an http-ping utility was used in
Rally for quick, easy access to its API using          order to measure roundtrip response times to
the Python scripting language. The total data          the service URLs provided by each system.
import process took about a week to complete.          Averaged response times over 10 http-ping
                                                       samples were all under 350 milliseconds and
                                                       within 150 milliseconds of each other,
6.   VersionOne Data Population Issues
                                                       suggesting connectivity and response are
                                                       comparable for all systems.
VersionOne did not limit the CSV file size, but
                                                       JIRA+GreenHopper had an average response
warned that importing more than 500 objects
                                                       time of 194 milliseconds, Rally 266, and
at a time could cause performance issues. This
                                                       VersionOne 343. All tests were performed
warning was absolutely true. During import,
                                                       during US MDT business hours (8 a.m. – 5:30
our VersionOne test system was totally
                                                       p.m.).
unresponsive to user operations. CSV files of
                                                           It is noted that running tests in a linear
5000 lines would lock it up for hours, making
                                                       manner does introduce the possibility of
data population take over a week of 24-hour
                                                       performance variation due to connectivity
days.
                                                       performance variations between endpoints,
                                                       though these variations would be expected
                                                       under any end-user usage scenario and are


                                                   4
difficult, if not impossible, to predict and             was not tested. The focus was on the collection
measure.                                                 of core tests described in the test definition
    Tests and data constructs were                       table in the next section.
implemented in a manner to allow apples-to-                  The time elapsed from the start of the first
apples comparison with as little bias and                request until the end of the last
potential benefit to any product as possible.            request/response was used as the core time
However, it should be noted that these are               metric associated with a requested page load
three different platforms, each with unique              when possible. This data is captured with
features. In the case where a feature exists on          Firebug, and an example is illustrated below
only one or two of the platforms, that element           for a VersionOne test.




                          Example of timing data collection for a VersionOne test.


    We encountered challenges timing pages               inefficiencies. Bias may also be introduced in
that perform operations using asynchronous               one or more products based on the testing
techniques to update or render data. Since we            methodology employed. While every effort
are interested in when the result of operations          was made to make tests fair and representative
are visible to the user, timing only the                 of legitimate use cases, it is recognized that
asynchronous call that initiates the request             results might vary if a different data set was
provides little value from a testing perspective.        used. Further, the testing has no control over
In cases where no single time event could be             localized performance issues affecting the
used, timing was performed manually. This                hosted environments from which the services
increased the error associated with the                  are provided. If testing results in minor
measurement, and this error is estimated to be           variance between products, then arguably
roughly one second or less. In cases where               some of this variance could be due to factors
manual measurements were made, it is                     outside of the actual application.
indicated in the result analysis. A stopwatch                The enterprise trial versions were used to
with 0.1-second granularity was used for all             test each system. We have no data regarding
manually timed tests, as were two people —               how each service handles trial instances; it is
one running the test with start/stop instruction         possible that the trial instances differ from
and the other timing from those verbal cues.             paid subscription instances, but based on our
    It is acknowledged that regardless of the            review and the trial process, there was no
constraints imposed here to standardize data             indication the trial version was in any way
and tests for comparison purposes, there may             different. We assume that providers would not
be deviations from performance norms due to              intentionally offer a performance-restricted
the use of simulated data, either efficiencies or        instance for trial customers, given that their

                                                     5
end goal would be to convert those trial               run for every test was performed to allow
customers to paying subscribers.                       object caching client-side — so in fact, each
    Based on a per-instance calibration routine,       test was executed 11 times, but only results 2-
the decision was made to repeat each test 10           11 were analyzed. Based on the belief that the
times per platform. A comparison between a             total artifact count is the root cause of
10-test and 50-test sample was performed for           scalability issues, allowing caching should
one test case (user story edit) per platform to        eliminate some of the variation due to factors
ensure the standard deviation between                  that cannot be controlled by the test.
respective tests was similar enough to warrant             The use of attachments was not tested.
the use of a 10-test sample. In no case was the        This was identified as more of a bandwidth
calibration standard deviation greater than one        and load test, as opposed to a performance of
second. If the performance differences                 the system in a scalability scenario.
between applications are found to be of a
similar order of magnitude (i.e., seconds), then
the use of a 10-test sample per application            8.   Test Descriptions
should clearly be questioned. However, if the
overriding observation is that each application        Tests were constructed based on common uses
performs within the same small performance             of ALM systems. Timing data was separated
range of the others, the nuances of sample size        into discrete operations when sequences of
calculation are rendered insignificant.                events were tested. These timings were
    A more in-depth sample sizing exercise             compared individually, as opposed to in
could also be performed, and could                     aggregate, in order to account for interface and
realistically be performed per test. However, it       workflow differences between products.
is already recognized that there are numerous               There may be tests and scenarios that
factors beyond the control of the tests, to the        could be of interest but were not captured,
extent that further increasing sample size             either because they were not reproducible in
would offer little value given the relatively          all products or were not identified as common
consistent performance observed during                 operations. Also, it would be desirable in
calibration.                                           future tests to review the performance of
    To help reduce as many bandwidth and               logical relationships (complex links between
geographic distance factors as possible, the           iterations/sprints and other artifacts, for
client browser cache was not cleared between           example). The core objective when selecting
tests. This also better reflects real user             these tests was to enable comparison for
interaction with the systems. A single pretest         similar operations between systems.


#   Test Name                 Description/Purpose

1   Refresh the backlog       The backlog page is important to both developers and managers; it
    for a single project.     is the heart of the systems. Based on variance in accessing the
                              backlog, the most reliable mechanism to test was identified as a
                              refresh of the backlog page. Views were configured to display 50
                              entries per page.

2   Switch backlog views      A developer working on two or more projects might frequently
    between two projects.     swap projects. Views were configured to display 50 entries per
                              page.

                                                   6
3    Paging through            With our large data sets, navigation of large tables can become a
     backlog lists.            performance issue. Views were configured to display 50 entries per
                               page.

4    Select and view a story Basic access to a story.
     from the backlog.

5    Select and view a task. Basic access to a task.

6    Select and view a         Basic access to a defect or bug. (Note: JIRA+GreenHopper uses
     defect/bug.               the term bug, while Rally and VersionOne use defect.)

7    Select and view a test.   Basic access to a test case.

8    Create an                 Common management chore. (Note: This had to be manually timed
     iteration/sprint.         for JIRA+GreenHopper, as measured time was about 0.3 seconds
                               while elapsed time was 17 seconds.)

9    Move a story to an        Common developer or manager chore. (Note: JIRA+GreenHopper
     iteration/sprint.         and VersionOne use the term sprint, while Rally uses iteration.)

10 Convert a story to a        Common developer chore (Note: This operation is not applicable
   defect/bug.                 to Rally because of the inherent hierarchy between a story and its
                               defects).

9.   Test Results                                     performed badly (subjectively). As such, the
                                                      leader in a test is given the “Very Good”
Each test was performed 1+10 times in                 rating, which corresponds to five points. The
sequence for each software system, and the            leading time is then used as a base for
mean and standard deviation were computed.            comparative scoring of competitors for that
The point estimates were then compared to             test, with each test score based on how many
find the fastest performing application. A +n         multiples it was of the fastest performer. The
(seconds) indicator was used to indicate the          point legend table is illustrated below.
relative performance lag of the other
applications from the fastest performing                        Time Multiple       Points
application for that test.                                    1.0x ≤ time < 1.5x      5
    The test result summary table illustrates                 1.5x ≤ time < 2.5x      4
the relative performance for each test to allow               2.5x ≤ time < 3.5x      3
observable comparisons per product and per                    3.5x ≤ time < 4.5x      2
test. In order to provide a measurement-based                    4.5x ≤ time          1
comparison, a scale was created to allow
numerical comparison between products.
There were no cases where the leader in a test




                                                  7
Test Result Summary Table (Relative Performance Analysis)
     Legend Very Good: (5) Good: (4) Acceptable: (3) Poor: (2) Very Poor: (1)

System and     Overall           1       2      3      4    5     6     7     8                               9
   Test        Rating         Backlog Switch Backlog View View View View Create                             Story
 Summary       (Out of        Refresh Backlog Paging Story Task Defect Test Sprint                            →
                 45)                                                                                        Sprint

   Rally                 43                                                                              

VersionOne          32                                                                                   
   JIRA+
GreenHopper
               18
                                                                                                         

        It must be noted that resulting means are          (symmetrically distributed), and 95% should
    point-estimate averages. For several reasons,          lie within two standard deviations. We
    we don’t suggest or use confidence intervals or        graphically tested for normality using our
    test for significance. Based on the challenges         calibration data and observed our data to be
    associated with structuring common tests with          normally distributed. When there is no overlap
    different interfaces, different data structures,       between timing at two standard deviations, this
    and no guarantee of connection quality, it is          implies it will be fairly rare for one of the
    extraordinarily difficult to do so. In addition,       typically slower performing applications to
    because each test may have a different weight          exceed the performance of the faster
    or relevance to each customer depending on             application (for that particular test). If there is
    their ALM process, the relevance of a test             no overlap at one or two standard deviations
    leader should be weighted according to the             between the lower and upper bounds, the result
    preference of the reader. That being said, these       is marked as “Significant.” If there is overlap
    tests are intended to reflect the user                 in one or both cases, that result is flagged as
    experience. To address some of the concerns            “Insignificant.” Significance is assessed
    associated with point estimates, analysis of           between the fastest performing application for
    high and low bounds based on one and two               the test and each of the other two applications.
    standard deviations was performed. If the high         Therefore, the significance analysis is only
    bound for the fastest test overlaps with the low       populated for the application with the fastest
    bound for either of the slower performing              point estimate. The advantage is classed as
    application tests, the significance of the             insignificant if the closest performing peer
    performance gain between those comparisons             implies the result is insignificant. All data
    is questionable. The overlap suggests there            values are in seconds.
    will be cases where the slower (overlapping)               Results from each test are analyzed
    application may perform faster than the                separately below. The results of each test are
    application with the fastest response time.            shown both in table form with values and in
        Statistical theory and the three-sigma rule        bar graph form, and are also interpreted in the
    suggest that when data is normally distributed,        text below the corresponding table. Note that
    roughly 68% of observations should lie within          long bars in the comparison graphs are long
    one standard deviation of the mean                     response times, and therefore bad.

                                                       8
Test 1: Refresh Backlog Page for a Single Project

  System         Mean    Standard     Point      1 SD                   1 SD           2 SD      2 SD
                Request Deviation   Estimate    Range                  Overlap        Range     Overlap
                 Time    (seconds) Comparison (seconds)                Analysis     (seconds)   Analysis
               (seconds)            (seconds)

   JIRA+         15.27        1.38         +12.13           13.89 –        -         12.52 –         -
GreenHopper                                                  16.64                    18.02

   Rally          5.53        0.29          +2.39           5.24 –         -         4.95 –          -
                                                             5.81                     6.10

VersionOne        3.14        0.25         Fastest          2.88 –    Significant    2.63 –     Significant
                                                             3.39                     3.64




  Interpretation: The data indicates that for this       almost 2.4 seconds. Both VersionOne and
  particular task, even when accounting for              Rally perform significantly better than
  variance in performance, VersionOne                    JIRA+GreenHopper when executing this
  performs fastest. Note that the advantage is           operation.
  relatively small when compared to Rally,
  though the Rally point estimate does lag by            Best Performer: VersionOne




                                                     9
Test 2: Switch Backlog Views Between Two Projects

  System            Mean          Standard     Point                  1 SD         1 SD             2 SD          2 SD
                   Request        Deviation  Estimate                Range        Overlap          Range         Overlap
                    Time          (seconds) Comparison             (seconds)      Analysis       (seconds)       Analysis
                  (seconds)                  (seconds)

   JIRA+             13.84           0.83           +11.39           13.01 –           -           12.19 –              -
GreenHopper                                                           14.66                         15.49

   Rally              2.45           0.16           Fastest          2.29 –      Significant       2.13 –       Significant
                                                                      2.60                          2.76

VersionOne            2.94           0.07           +0.49            2.87 –            -           2.79 –               -
                                                                      3.01                          3.08




    *To perform this operation on JIRA+GreenHopper, the user must navigate between two scrumboards and then load
    the data. Therefore, the timing numbers for JIRA+GreenHopper are the sum of two measurements. This introduces
    request overhead not present in the other two tests, yet the disparity suggests more than just simple transaction
    overhead is the cause of the delay. Furthermore, the resulting page was rendered frozen and was not usable for an
    additional 10 – 15 seconds. Users would probably pool that additional delay before the page could be accessed in
    their user experience impression, but it was not included here.

    Interpretation: The data indicates that Rally               user interaction, the experience would be
    and VersionOne are significantly faster than                similar for the two products.
    JIRA+GreenHopper, even when considering
    the sum of two operations. Rally is faster than             Best Performer: Rally
    VersionOne, though marginally so. In terms of

                                                           10
Test 3: Paging Through Backlog List

  System         Mean       Standard     Point      1 SD                    1 SD             2 SD       2 SD
                Request     Deviation  Estimate    Range                   Overlap          Range      Overlap
                 Time       (seconds) Comparison (seconds)                 Analysis       (seconds)    Analysis
               (seconds)               (seconds)

   JIRA+          1.53          0.66         Fastest           0.87 –     Insignificant    0.21 –     Insignificant
GreenHopper                                                     2.19                        2.85

   Rally          1.93          0.11          +0.4             1.81 –           -          1.70 –            -
                                                                2.04                        2.15

VersionOne        3.45          0.29         +1.92             3.16 –           -          2.87 –            -
                                                                3.74                        4.04




    Interpretation: JIRA+GreenHopper had the                likely to be comparable. The data indicates
    fastest point-estimate mean, but the analysis           that VersionOne is significantly slower than
    suggests there is minimal (not significant)             the other two systems, and for very large data
    improvement over Rally, which was the                   sets like the tests used, this makes scrolling
    second-fastest. The standard deviations                 through the data quite tedious.
    suggest a wider performance variance for
    JIRA+GreenHopper, and so while the point                Best Performer: JIRA+GreenHopper and
    estimate is better, the overall performance is          Rally


                                                       11
Test 4: Selecting and Viewing a User Story From the Backlog

  System           Mean       Standard     Point                   1 SD        1 SD            2 SD        2 SD
                  Request     Deviation  Estimate                 Range       Overlap         Range       Overlap
                   Time       (seconds) Comparison              (seconds)     Analysis      (seconds)     Analysis
                 (seconds)               (seconds)

   JIRA+           3.49          0.99          +2.95              2.49 –          -          1.50 –            -
GreenHopper                                                        4.48                       5.47

   Rally           0.53           .07         Fastest             0.46 –     Significant     0.40 –      Significant
                                                                   0.60                       0.67

VersionOne         1.90          0.30          +1.36              1.59 –          -        1.29 – 2.5          -
                                                                   2.20




    Interpretation: The data indicates that Rally is        experience. Rally’s performance is also more
    significantly faster than either                        consistent than the other two products (i.e., it
    JIRA+GreenHopper or VersionOne. While the               has a much lower response standard
    result is significant, the one-second difference        deviation).
    between Rally and VersionOne is not likely to
    have a significant impact on the user                   Best Performer: Rally




                                                       12
Test 5: Selecting and Viewing a Task

  System           Mean        Standard     Point                   1 SD        1 SD           2 SD          2 SD
                  Request      Deviation  Estimate                 Range       Overlap        Range         Overlap
                   Time        (seconds) Comparison              (seconds)     Analysis     (seconds)       Analysis
                 (seconds)                (seconds)

   JIRA+            1.36          0.17         +0.92               1.20 –          -          1.03 –             -
GreenHopper                                                         1.53                       1.69

   Rally            0.44          0.03         Fastest             0.42 –     Significant     0.39 –        Significant
                                                                    0.47                       0.50

VersionOne          1.46          0.16         +1.01               1.29 –          -          1.13 –             -
                                                                    1.62                       1.78




    Interpretation: The data indicates that Rally is         VersionOne showed similar performance.
    significantly (in the probabilistic sense) faster        Overall, the result for all applications was
    than either JIRA+GreenHopper or VersionOne               qualitatively good.
    by about one second, and also has a more
    consistent response time (with the lowest                Best Performer: Rally
    standard deviation). JIRA+GreenHopper and




                                                        13
Test 6: Selecting and Viewing a Test Case

  System          Mean       Standard     Point                  1 SD        1 SD           2 SD        2 SD
                 Request     Deviation  Estimate                Range       Overlap        Range       Overlap
                  Time       (seconds) Comparison             (seconds)     Analysis     (seconds)     Analysis
                (seconds)               (seconds)

   JIRA+           1.91         0.86         +1.37              1.05 –          -         0.19 –           -
GreenHopper                                                      2.77                      3.64

   Rally           0.54         0.13         Fastest            0.41 –     Significant    0.28 –     Insignificant
                                                                 0.67                      0.80

VersionOne         1.45         0.18         +0.91              1.27 –          -         1.09 –           -
                                                                 1.62                      1.80




    Interpretation: The data indicates that, again,         suggesting a consistently better experience.
    Rally is fastest in this task, though the speed         VersionOne was second in terms of
    differences are significant at the one standard         performance, followed by
    deviation level where there is no overlap in            JIRA+GreenHopper.
    their respective timing ranges, but not at two
    standard deviations. Rally performed with the           Best Performer: Rally
    lowest point estimate and the lowest variance,




                                                       14
Test 7: Selecting and Viewing a Defect/Bug

  System          Mean       Standard     Point                  1 SD        1 SD           2 SD        2 SD
                 Request     Deviation  Estimate                Range       Overlap        Range       Overlap
                  Time       (seconds) Comparison             (seconds)     Analysis     (seconds)     Analysis
                (seconds)               (seconds)

   JIRA+           1.70         0.81          +1.02             0.88 –          -         0.07 –              -
GreenHopper                                                      2.51                      3.32

   Rally           0.68         0.05         Fastest            0.63 –     Significant    0.58 –     Insignificant
                                                                 0.72                      0.77

VersionOne         1.74         0.17          +1.06             1.56 –          -         1.39 –              -
                                                                 1.91                      2.08




    Interpretation: The data indicates that Rally is        very low standard deviation. Though the point
    faster by roughly one second based on the               estimates are very close, the performance of
    point-estimate mean when compared to the                VersionOne is preferred based on the low
    other two products, with the difference being           standard deviation. That being said, given that
    significant at the one standard deviation level         the point estimates are all below two seconds,
    but not at two standard deviations. Variance in         there would be little to no perceptible
    the results of the other products suggests they         difference between VersionOne and
    will perform similarly to Rally on some                 JIRA+GreenHopper from a user perspective.
    occasions, but not all. Rally’s performance
    was relatively consistent, as indicated by the          Best Performer: Rally

                                                       15
Test 8: Add an Iteration/Sprint

  System            Mean         Standard     Point                  1 SD         1 SD            2 SD          2 SD
                   Request       Deviation  Estimate                Range        Overlap         Range         Overlap
                    Time         (seconds) Comparison             (seconds)      Analysis      (seconds)       Analysis
                  (seconds)                 (seconds)

   JIRA+             17.76          0.60           +17.72          17.16 –            -          16.56 –             -
GreenHopper                                                         18.36                         18.96

   Rally             0.04           0.00           Fastest          0.04 –      Significant      0.03 –       Significant
                                                                     0.05                         0.05

VersionOne           1.36           0.10           +1.32            1.25 –            -          1.15 –              -
                                                                     1.46                         1.57




    *Due to the disparity between Rally and JIRA+GreenHopper here, the graph appears to show no data for Rally.
    The graph resolution is simply insufficient to render the data clearly, given the large value generated by
    JIRA+GreenHopper tests.
    **The JIRA+GreenHopper data was manually measured due to inconsistencies in timing versus content rendering.
    Based on requests, it appeared asynchronous page timings were completing when requests were submitted, and the
    eventual content updates and rendering were disconnected from the original request being tracked. While this
    increases the measurement error, it certainly would not account for a roughly 17-second disparity.

    Interpretation: Rally is the fastest performer             JIRA+GreenHopper is many times slower than
    in this test, with the results being significant at        both Rally and VersionOne.
    both the one and two standard deviation levels.
                                                               Best Performer: Rally

                                                          16
Test 9: Move a Story to an Iteration/Sprint

  System            Mean         Standard     Point                     1 SD        1 SD             2 SD             2 SD
                   Request       Deviation  Estimate                   Range       Overlap          Range            Overlap
                    Time         (seconds) Comparison                (seconds)     Analysis       (seconds)          Analysis
                  (seconds)                 (seconds)

   JIRA+             9.80            6.88            +8.42             2.91 –           -          0.00* –                -
GreenHopper                                                            16.68                        23.56

   Rally             3.37            0.22            +1.99             3.15 –           -           2.94 –                -
                                                                        3.59                         3.80

VersionOne           1.38            0.36           Fastest            1.02 –     Significant       0.66 –          Insignificant
                                                                        1.74                         2.09
    *The standard deviation range suggested a negative value, which is, of course, impossible. Therefore, 0.00 is
    provided.




    Interpretation: The data indicates that                        test is a result of the enormous standard
    VersionOne is fastest for this operation. The                  deviation of the JIRA+GreenHopper tests.
    insignificant two standard deviation overlap
                                                                   Best Performer: VersionOne




                                                              17
Test 10: Convert a Story to a Defect/Bug

  System           Mean         Standard          Point              1 SD        1 SD            2 SD       2 SD
                  Request       Deviation       Estimate            Range       Overlap         Range      Overlap
                   Time         (seconds)      Comparison         (seconds)     Analysis      (seconds)    Analysis
                 (seconds)                      (seconds)

   JIRA+            26.56          2.94           +24.87           23.62 –           -         20.68 –            -
GreenHopper                                                         29.50                       32.44

   Rally            1.69           0.25           Fastest           1.44 –     Significant     1.19 –     Significant
                                                                     1.94                       2.19

VersionOne          6.06           0.28            +4.36            5.77 –           -         5.49 –             -
                                                                     6.34                       6.62




    *JIRA+GreenHopper required manual timing. See the interpretation below for explanation.

    Interpretation: This operation is an example              update for about 10 seconds while it updated
    of one in which the procedure in each system              the icon to the left of the new defect from a
    is completely different and perhaps not                   green story icon to a red defect icon. This extra
    comparable in any reasonable way. In                      10 seconds was not included in the timing
    JIRA+GreenHopper, there are three operations              results, although perhaps it should have been.
    involved (access the story, invoke the editor,            In Rally, defects are hierarchically below
    and after changing the type of issue, saving the          stories as one of a story’s attributes, and so a
    changes and updating the database) and these              story cannot be converted to a defect, though
    had to be manually timed. In addition, the                defects can be promoted to stories. That is
    JIRA+GreenHopper page froze after the                     what we measured for Rally’s case. And

                                                         18
finally, VersionOne has a menu option to do              scrumboard, which JIRA+GreenHopper
this task. The results, reported here just for           implements with the plug-in GreenHopper.
interest and not defensible statistically,               The GreenHopper overlay/add-on seemed
indicate that Rally is fastest at this class of          unable to handle the large data sets effectively.
operation, followed by VersionOne at plus-               When we tried to include the test of viewing
four seconds and JIRA+GreenHopper at +24                 the backlog for all projects, we were able to do
seconds.                                                 so for Rally and VersionOne, but the
                                                         JIRA+GreenHopper instance queried for over
Best Performer: N/A – Informational                      12 hours without rendering the scrumboard
observations only.                                       and merged project backlog. Some object view
                                                         operations resulted in second-best performance
                                                         for JIRA+GreenHopper, but with the
10. Conclusions                                          exception of viewing tasks, the variance
                                                         associated with request was extraordinarily
Our testing was by no means exhaustive, but              high compared to Rally and VersionOne. The
thorough enough to build a reasonably sized              large variance will manifest to users as an
result set to enable comparison between                  inconsistent experience (in terms of response
applications. It fundamentally aimed to assess           time) when performing the same operation.
the performance of testable elements that are                Anecdotally, the performance of
consistent between applications. We tried to             VersionOne compared to Rally was
choose simple, small tests that mapped well              significantly degraded when import activity
between the three systems and could be                   was taking place, to the extent that
measured programmatically as opposed to                  VersionOne becomes effectively unusable
manually (and succeeded in most cases,                   during import operations. Further testing could
though some manual timing was required).                 be performed to identify whether this is a
    Rally was the strongest performer based on           CSV-limited import issue or if it extends to
the test results, leading outright in six of the         programmatic API access, as well. Given how
nine that were compared. In one of these six             many platforms utilize API access regularly, it
tests, Rally tied with VersionOne from a                 would be interesting to explore this result
scoring perspective in terms of relative                 further.
performance (using the scoring system                        Both Rally and VersionOne appear to
developed for comparisons), though it led                provide a reasonable user experience that
from a raw measured-speed perspective. In                should satisfy customers in most cases when
one test not included in the six, Rally tied with        the applications are utilizing large data sets
JIRA+GreenHopper from a numeric                          with over 500,000 artifacts.
perspective and within the bounds of the                 JIRA+GreenHopper is significantly
scoring model that was established.                      disadvantaged from a performance
VersionOne was the strongest performer in                perspective, and seems less suitable for
two of the nine tests, and exhibited very                customers with large artifact counts or with
similar performance characteristics (generally           aggressive growth expectations. Factors such
within a 1 – 12 second margin) in many of the            as user concurrency, variations in sprint
tests that Rally led. JIRA+GreenHopper did               structure, and numerous others have the
not lead any tests, but as noted, tied with Rally        potential to skew results in either direction,
for one.                                                 and it is difficult to predict how specific use
    With the exception of backlog paging,                cases may affect performance. These tests do,
JIRA+GreenHopper trailed in tests that                   however, provide a reasonable comparative
leveraged agile development tools such as the

                                                    19
baseline, suggesting Rally has a slight
performance advantage in general, followed
closely by VersionOne.


References

A variety of references were used to help build
and execute a performance testing
methodology that would allow a reasonable,
statistically supported comparison of the
performance of the three ALM systems. In
addition to documentation available at the
websites for each product, the following
resources were used:

“Agile software development.” Wikipedia.
       Accessed Sept. 28, 2012 from
       http://en.wikipedia.org/wiki/Agile_soft
       ware_development.

Beedle, Mike, et al. “Manifesto for Agile
       Software Development.” Accessed
       Sept. 28, 2012 from
       http://agilemanifesto.org.

Hewitt, Joe, et al. Firebug: Add-ons for
       Firefox. Mozilla. Accessed Sept. 28,
       2012 from
       http://addons.mozilla.org/en-
       us/firefox/addon/firebug.

Honza. “Firebug Net Panel Timings.”
       Software is Hard. Accessed Sept. 28,
       2012 from
       http://www.softwareishard.com/blog/fi
       rebug/firebug-net-panel-timings.

Peter. “Top Agile and Scrum Tools – Which
        One Is Best?” Agile Scout. Accessed
        Sept. 28, 2012 from
        http://agilescout.com/best-agile-scrum-
        tools.




                                                  20

Mais conteúdo relacionado

Semelhante a Performance Analysis of Leading Application Lifecycle Management Systems for Large Customer Data Environments

Crossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed DeployedCrossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed DeployedRobert Grossman
 
Runtime Behavior of JavaScript Programs
Runtime Behavior of JavaScript ProgramsRuntime Behavior of JavaScript Programs
Runtime Behavior of JavaScript ProgramsIRJET Journal
 
Qubole on AWS - White paper
Qubole on AWS - White paper Qubole on AWS - White paper
Qubole on AWS - White paper Vasu S
 
Computer aided design, computer aided manufacturing, computer aided engineering
Computer aided design, computer aided manufacturing, computer aided engineeringComputer aided design, computer aided manufacturing, computer aided engineering
Computer aided design, computer aided manufacturing, computer aided engineeringuniversity of sust.
 
Adapting data warehouse architecture to benefit from agile methodologies
Adapting data warehouse architecture to benefit from agile methodologiesAdapting data warehouse architecture to benefit from agile methodologies
Adapting data warehouse architecture to benefit from agile methodologiesbboyina
 
Hybrid Knowledge Bases for Real-Time Robotic Reasoning
Hybrid Knowledge Bases for Real-Time Robotic ReasoningHybrid Knowledge Bases for Real-Time Robotic Reasoning
Hybrid Knowledge Bases for Real-Time Robotic ReasoningHassan Rifky
 
Solving big data challenges for enterprise application
Solving big data challenges for enterprise applicationSolving big data challenges for enterprise application
Solving big data challenges for enterprise applicationTrieu Dao Minh
 
Adapting data warehouse architecture to benefit from agile methodologies
Adapting data warehouse architecture to benefit from agile methodologiesAdapting data warehouse architecture to benefit from agile methodologies
Adapting data warehouse architecture to benefit from agile methodologiesTom Breur
 
Amplitude wave architecture - Test
Amplitude wave architecture - TestAmplitude wave architecture - Test
Amplitude wave architecture - TestKiran Naiga
 
Implement Test Harness For Streaming Data Pipelines
Implement Test Harness For Streaming Data PipelinesImplement Test Harness For Streaming Data Pipelines
Implement Test Harness For Streaming Data PipelinesKnoldus Inc.
 
redpill Mobile Case Study (Salvation Army)
redpill Mobile Case Study (Salvation Army)redpill Mobile Case Study (Salvation Army)
redpill Mobile Case Study (Salvation Army)Peter Presnell
 
Performance testing : An Overview
Performance testing : An OverviewPerformance testing : An Overview
Performance testing : An Overviewsharadkjain
 
Bug Triage: An Automated Process
Bug Triage: An Automated ProcessBug Triage: An Automated Process
Bug Triage: An Automated ProcessIRJET Journal
 
Research Inventy : International Journal of Engineering and Science
Research Inventy : International Journal of Engineering and ScienceResearch Inventy : International Journal of Engineering and Science
Research Inventy : International Journal of Engineering and Scienceinventy
 
From Relational Database Management to Big Data: Solutions for Data Migration...
From Relational Database Management to Big Data: Solutions for Data Migration...From Relational Database Management to Big Data: Solutions for Data Migration...
From Relational Database Management to Big Data: Solutions for Data Migration...Cognizant
 
markfinleyResumeMarch2016
markfinleyResumeMarch2016markfinleyResumeMarch2016
markfinleyResumeMarch2016Mark Finley
 
Harnessing the Cloud for Performance Testing- Impetus White Paper
Harnessing the Cloud for Performance Testing- Impetus White PaperHarnessing the Cloud for Performance Testing- Impetus White Paper
Harnessing the Cloud for Performance Testing- Impetus White PaperImpetus Technologies
 
Benchmarking Techniques for Performance Analysis of Operating Systems and Pro...
Benchmarking Techniques for Performance Analysis of Operating Systems and Pro...Benchmarking Techniques for Performance Analysis of Operating Systems and Pro...
Benchmarking Techniques for Performance Analysis of Operating Systems and Pro...IRJET Journal
 

Semelhante a Performance Analysis of Leading Application Lifecycle Management Systems for Large Customer Data Environments (20)

Crossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed DeployedCrossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
 
API Integration
API IntegrationAPI Integration
API Integration
 
Runtime Behavior of JavaScript Programs
Runtime Behavior of JavaScript ProgramsRuntime Behavior of JavaScript Programs
Runtime Behavior of JavaScript Programs
 
Qubole on AWS - White paper
Qubole on AWS - White paper Qubole on AWS - White paper
Qubole on AWS - White paper
 
Computer aided design, computer aided manufacturing, computer aided engineering
Computer aided design, computer aided manufacturing, computer aided engineeringComputer aided design, computer aided manufacturing, computer aided engineering
Computer aided design, computer aided manufacturing, computer aided engineering
 
Adapting data warehouse architecture to benefit from agile methodologies
Adapting data warehouse architecture to benefit from agile methodologiesAdapting data warehouse architecture to benefit from agile methodologies
Adapting data warehouse architecture to benefit from agile methodologies
 
Hybrid Knowledge Bases for Real-Time Robotic Reasoning
Hybrid Knowledge Bases for Real-Time Robotic ReasoningHybrid Knowledge Bases for Real-Time Robotic Reasoning
Hybrid Knowledge Bases for Real-Time Robotic Reasoning
 
Solving big data challenges for enterprise application
Solving big data challenges for enterprise applicationSolving big data challenges for enterprise application
Solving big data challenges for enterprise application
 
Adapting data warehouse architecture to benefit from agile methodologies
Adapting data warehouse architecture to benefit from agile methodologiesAdapting data warehouse architecture to benefit from agile methodologies
Adapting data warehouse architecture to benefit from agile methodologies
 
Amplitude wave architecture - Test
Amplitude wave architecture - TestAmplitude wave architecture - Test
Amplitude wave architecture - Test
 
Implement Test Harness For Streaming Data Pipelines
Implement Test Harness For Streaming Data PipelinesImplement Test Harness For Streaming Data Pipelines
Implement Test Harness For Streaming Data Pipelines
 
redpill Mobile Case Study (Salvation Army)
redpill Mobile Case Study (Salvation Army)redpill Mobile Case Study (Salvation Army)
redpill Mobile Case Study (Salvation Army)
 
Performance testing : An Overview
Performance testing : An OverviewPerformance testing : An Overview
Performance testing : An Overview
 
Bug Triage: An Automated Process
Bug Triage: An Automated ProcessBug Triage: An Automated Process
Bug Triage: An Automated Process
 
Research Inventy : International Journal of Engineering and Science
Research Inventy : International Journal of Engineering and ScienceResearch Inventy : International Journal of Engineering and Science
Research Inventy : International Journal of Engineering and Science
 
From Relational Database Management to Big Data: Solutions for Data Migration...
From Relational Database Management to Big Data: Solutions for Data Migration...From Relational Database Management to Big Data: Solutions for Data Migration...
From Relational Database Management to Big Data: Solutions for Data Migration...
 
markfinleyResumeMarch2016
markfinleyResumeMarch2016markfinleyResumeMarch2016
markfinleyResumeMarch2016
 
Harnessing the Cloud for Performance Testing- Impetus White Paper
Harnessing the Cloud for Performance Testing- Impetus White PaperHarnessing the Cloud for Performance Testing- Impetus White Paper
Harnessing the Cloud for Performance Testing- Impetus White Paper
 
Benchmarking Techniques for Performance Analysis of Operating Systems and Pro...
Benchmarking Techniques for Performance Analysis of Operating Systems and Pro...Benchmarking Techniques for Performance Analysis of Operating Systems and Pro...
Benchmarking Techniques for Performance Analysis of Operating Systems and Pro...
 
chapter 7.docx
chapter 7.docxchapter 7.docx
chapter 7.docx
 

Performance Analysis of Leading Application Lifecycle Management Systems for Large Customer Data Environments

  • 1. Performance Analysis of Leading Application Lifecycle Management Systems for Large Customer Data Environments Paul Nelson Director, Enterprise Systems Management, AppliedTrust, Inc. paul@appliedtrust.com Dr. Evi Nemeth Associate Professor Attendant Rank Emeritus, University of Colorado at Boulder Distinguished Engineer, AppliedTrust, Inc. evi@appliedtrust.com Tyler Bell Engineer, AppliedTrust, Inc. tyler@appliedtrust.com AppliedTrust, Inc. 1033 Walnut St, Boulder, CO 80302 (303) 245-4545 Abstract The performance of three leading application lifecycle management (ALM) systems (Rally by Rally Software, VersionOne by VersionOne, and JIRA+GreenHopper by Atlassian) was assessed to draw comparative performance observations when customer data exceeds a 500,000- artifact threshold. The focus of this performance testing was how each system handles a simulated “large” customer (i.e., a customer with half a million artifacts). A near-identical representative data set of 512,000 objects was constructed and populated in each system in order to simulate identical use cases as closely as possible. Timed browser testing was performed to gauge the performance of common usage scenarios, and comparisons were then made. Nine tests were performed based on measurable, single-operation events. Rally emerged as the strongest performer based on the test results, leading outright in six of the nine that were compared. In one of these six tests, Rally tied with VersionOne from a scoring perspective in terms of relative performance (using the scoring system developed for comparisons), though it led from a raw measured-speed perspective. In one test not included in the six, Rally tied with JIRA+GreenHopper from a numeric perspective and within the bounds of the scoring model that was established. VersionOne was the strongest performer in two of the nine tests, and exhibited very similar performance characteristics (generally within a 1 – 12 second margin) in many of the tests that Rally led. JIRA+GreenHopper did not lead any tests, but as noted, tied with Rally for one. JIRA+GreenHopper was almost an order of magnitude slower than peers when performing any test that involved its agile software development plug-in. All 1
  • 2. applications were able to complete the tests being performed (i.e., no tests failed outright). Based on the results, Rally and VersionOne, but not JIRA+GreenHopper, appear to be viable solutions for clients with a large number of artifacts. 1. Introduction JIRA+GreenHopper JIRA 5.1with GreenHopper 6 were the versions that were As the adoption of agile project management tested. has accelerated over the last decade, so too has The tests measure the performance of the use of tools supporting this methodology. single-user, single-operation events when an This growth has resulted in the accumulation underlying customer data set made up of of artifacts (user stories, defects, tasks, and test 500,000 objects is present. These tests are not cases) by customers in their ALM system of intended to be used to draw conclusions choice. The trend is for data stored in these regarding other possible scenarios of interest, systems to be retained indefinitely, as there is such as load, concurrent users, or other tests no compelling reason to remove it, and often, not explicitly described. product generations are developed and The fundamental objective of the testing is improved over significant periods of time. In to provide some level of quantitative other cases, the size of specific customers and comparison for user-based interaction with the ongoing projects may result in very rapid three products, as opposed to system- or accumulation of artifacts in relatively short service-based interaction. periods of time. Anecdotal reports suggest that an artifact threshold exists around the 500,000 artifact point, and this paper seeks to test that 2. Data Set Construction observation. This artifact scaling presents a challenge The use of ALM software and the variety of for ALM solution providers, as customers artifacts, custom fields, etc., will vary expect performance consistency in their ALM significantly between customers. As a result, platform regardless of the volume of the there is not necessarily a “right way” to underlying data. While it is certainly possible structure data for test purposes. More to architect ALM systems to address such important is that fields contain content that is challenges, there are anecdotal reports that similarly structured to real data (e.g., text in some major platforms do not currently handle freeform text fields, dates in date fields), and large projects in a sufficient manner from a that each platform is populated with the same performance perspective. data. In some cases, product variations This paper presents the results of testing prevented this. Rally, for example, does not performed in August and September 2012, use the concept of an epic, but rather a recording the performance of Rally Software, hierarchical, user story relationship, whereas VersionOne, and JIRA+GreenHopper, and VersionOne supports epics. then drawing comparative conclusions Actually creating data with unique content between the three products. Atlassian’s ALM for all artifacts would be infeasible for testing offering utilizes its JIRA product and extends purposes. To model real data, a structure was it to support agile project management using chosen for a customer instance based on 10 the GreenHopper functionality extension unique projects. Within each project, 40 epics (referred to in this paper as or parent user stories were populated, and 80 JIRA+GreenHopper). Rally Build 7396, user stories were created within each of those. VersionOne 12.2.2.3601, and Associated with each user story were 16 artifacts: 10 tasks, four defects, and two test 2
  • 3. cases. In terms of core artifact types, the This generator produces text containing real product of these counts is 16*80*40*10, or sentence and paragraph structures, but random 512,000. All platforms suffered from strings as words. A number of paragraph size difficulties related to data population. This and content blocks were created, and their use manifested in a variety of ways, including was repeated in multiple objects. The imports “freezing,” data being truncated, or description field of a story contained one or data being mismapped to incorrect fields. two paragraphs of this generated text. Tasks, Every effort was made to ensure as much data defects, and tests used one or two sentences. If consistency between data uploads as possible, one story got two paragraphs, then the next but there were slight deviations from the story would get one paragraph, and so on in expected norm. This was estimated to be no rotation. This data model was used for each more than 5%, and where there was missing system. data, supplementary uploads were performed It is possible that one or more of the to move the total artifact count closer to the products may be able to optimize content 512,000 target. In addition, tests were only retrieval with an effective indexing strategy, performed on objects that met consistency but this advantage is implementable in each checks (i.e., the same field data). product. Only JIRA+GreenHopper prompted These symmetrical project data structures the user to initiate indexing operations, and are not likely to be seen in real customer based on prompted instruction, indexing was environments. The numbers of parent objects performed after data uploads were complete. and child objects will also vary considerably. That being said, a standard form is required to allow population in three products and to 3. Data Population enable attempts at some level of data consistency. Given that the structure is Data was populated primarily by using the mirrored as closely as possible across each CSV import functionality offered by each product, the performance variance should be system. This process varied in the operation indicative of observed behaviors in other sequence and chunking mechanism for customer environments regardless of the exact uploads, but fundamentally was based on artifact distributions. tailoring input files to match the input Custom fields are offered by all products, specifications and uploading a sequence of and so a number of fields were added and files. Out of necessity, files were uploaded in populated to simulate their use. Five custom various-sized pieces related to input limits for fields were added to each story, task, defect, each system. API calls and scripts were used and test case; one was Boolean true/false, two to establish relationships between artifacts were numerical values, and two were short text when the CSV input method did not support or fields. retain these relationships. We encountered The data populated followed the schema issues with each vendor’s product in importing specified by each vendor’s documentation. We such a large data set, which suggests that populated fields for ID, name, description, customers considering switching from one priority, and estimated cost and time to product to another should look carefully at the complete. The data consisted of dates and feasibility of loading their existing data. Some times, values from fixed lists (e.g., the priority of our difficulty in loading data involved the field with each possible value used in turn), fact that we wanted to measure comparable references to other objects (parent ID), and operations, and the underlying data structures text generated by a lorem ipsum generator. made this sometimes easy, sometimes nearly impossible. 3
  • 4. 4. JIRA+GreenHopper Data Population 7. Testing Methodology Issues A single test system was used to collect test We had to create a ‘Test Case’ issue type in data in order to limit bias introduced by the JIRA+GreenHopper product and use what different computers and browser instances. is known in the JIRA+GreenHopper The test platform was a Dell Studio XPS 8100 community as a bug to keep track of the running Microsoft Windows 7 Professional parent-child hierarchy of data objects. Once SP1 64-bit, and the browser used to perform this was done, the data loaded quite smoothly testing was Mozilla Firefox v15.0.1. The using CSV files and its import facility until we Firebug add-on running v1.10.3 was used to reached the halfway point, when the import collect test metrics. Timing data was recorded process slowed down considerably. in a data collection spreadsheet constructed for Ultimately, the data import took two to three this project. While results are expected to vary full days to complete. if using other software and version combinations, using a standardized collection model ensured a consistent, unbiased approach 5. Rally Data Population Issues to gathering test data for this paper, and will allow legitimate comparisons to be made. It is Rally limits the size of CSV files to 1000 lines expected that while the actual timing averages and 2.097 MB. It also destroys the may differ, the comparisons will not. UserStory/SubStory hierarchy on import At the time measurements were being (though presents it on export). These taken, the measurement machine was the only limitations led to a lengthy and tedious data user of our instance of the software products. population operation. Tasks could not be All tests were performed using the same imported using the CSV technique. Instead, network and Internet connection, with no scripting was used to import tasks via Rally’s software updates or changes between tests. To REST API interface. The script was made ensure there were no large disparities between using Pyral, which is a library released by response times, an http-ping utility was used in Rally for quick, easy access to its API using order to measure roundtrip response times to the Python scripting language. The total data the service URLs provided by each system. import process took about a week to complete. Averaged response times over 10 http-ping samples were all under 350 milliseconds and within 150 milliseconds of each other, 6. VersionOne Data Population Issues suggesting connectivity and response are comparable for all systems. VersionOne did not limit the CSV file size, but JIRA+GreenHopper had an average response warned that importing more than 500 objects time of 194 milliseconds, Rally 266, and at a time could cause performance issues. This VersionOne 343. All tests were performed warning was absolutely true. During import, during US MDT business hours (8 a.m. – 5:30 our VersionOne test system was totally p.m.). unresponsive to user operations. CSV files of It is noted that running tests in a linear 5000 lines would lock it up for hours, making manner does introduce the possibility of data population take over a week of 24-hour performance variation due to connectivity days. performance variations between endpoints, though these variations would be expected under any end-user usage scenario and are 4
  • 5. difficult, if not impossible, to predict and was not tested. The focus was on the collection measure. of core tests described in the test definition Tests and data constructs were table in the next section. implemented in a manner to allow apples-to- The time elapsed from the start of the first apples comparison with as little bias and request until the end of the last potential benefit to any product as possible. request/response was used as the core time However, it should be noted that these are metric associated with a requested page load three different platforms, each with unique when possible. This data is captured with features. In the case where a feature exists on Firebug, and an example is illustrated below only one or two of the platforms, that element for a VersionOne test. Example of timing data collection for a VersionOne test. We encountered challenges timing pages inefficiencies. Bias may also be introduced in that perform operations using asynchronous one or more products based on the testing techniques to update or render data. Since we methodology employed. While every effort are interested in when the result of operations was made to make tests fair and representative are visible to the user, timing only the of legitimate use cases, it is recognized that asynchronous call that initiates the request results might vary if a different data set was provides little value from a testing perspective. used. Further, the testing has no control over In cases where no single time event could be localized performance issues affecting the used, timing was performed manually. This hosted environments from which the services increased the error associated with the are provided. If testing results in minor measurement, and this error is estimated to be variance between products, then arguably roughly one second or less. In cases where some of this variance could be due to factors manual measurements were made, it is outside of the actual application. indicated in the result analysis. A stopwatch The enterprise trial versions were used to with 0.1-second granularity was used for all test each system. We have no data regarding manually timed tests, as were two people — how each service handles trial instances; it is one running the test with start/stop instruction possible that the trial instances differ from and the other timing from those verbal cues. paid subscription instances, but based on our It is acknowledged that regardless of the review and the trial process, there was no constraints imposed here to standardize data indication the trial version was in any way and tests for comparison purposes, there may different. We assume that providers would not be deviations from performance norms due to intentionally offer a performance-restricted the use of simulated data, either efficiencies or instance for trial customers, given that their 5
  • 6. end goal would be to convert those trial run for every test was performed to allow customers to paying subscribers. object caching client-side — so in fact, each Based on a per-instance calibration routine, test was executed 11 times, but only results 2- the decision was made to repeat each test 10 11 were analyzed. Based on the belief that the times per platform. A comparison between a total artifact count is the root cause of 10-test and 50-test sample was performed for scalability issues, allowing caching should one test case (user story edit) per platform to eliminate some of the variation due to factors ensure the standard deviation between that cannot be controlled by the test. respective tests was similar enough to warrant The use of attachments was not tested. the use of a 10-test sample. In no case was the This was identified as more of a bandwidth calibration standard deviation greater than one and load test, as opposed to a performance of second. If the performance differences the system in a scalability scenario. between applications are found to be of a similar order of magnitude (i.e., seconds), then the use of a 10-test sample per application 8. Test Descriptions should clearly be questioned. However, if the overriding observation is that each application Tests were constructed based on common uses performs within the same small performance of ALM systems. Timing data was separated range of the others, the nuances of sample size into discrete operations when sequences of calculation are rendered insignificant. events were tested. These timings were A more in-depth sample sizing exercise compared individually, as opposed to in could also be performed, and could aggregate, in order to account for interface and realistically be performed per test. However, it workflow differences between products. is already recognized that there are numerous There may be tests and scenarios that factors beyond the control of the tests, to the could be of interest but were not captured, extent that further increasing sample size either because they were not reproducible in would offer little value given the relatively all products or were not identified as common consistent performance observed during operations. Also, it would be desirable in calibration. future tests to review the performance of To help reduce as many bandwidth and logical relationships (complex links between geographic distance factors as possible, the iterations/sprints and other artifacts, for client browser cache was not cleared between example). The core objective when selecting tests. This also better reflects real user these tests was to enable comparison for interaction with the systems. A single pretest similar operations between systems. # Test Name Description/Purpose 1 Refresh the backlog The backlog page is important to both developers and managers; it for a single project. is the heart of the systems. Based on variance in accessing the backlog, the most reliable mechanism to test was identified as a refresh of the backlog page. Views were configured to display 50 entries per page. 2 Switch backlog views A developer working on two or more projects might frequently between two projects. swap projects. Views were configured to display 50 entries per page. 6
  • 7. 3 Paging through With our large data sets, navigation of large tables can become a backlog lists. performance issue. Views were configured to display 50 entries per page. 4 Select and view a story Basic access to a story. from the backlog. 5 Select and view a task. Basic access to a task. 6 Select and view a Basic access to a defect or bug. (Note: JIRA+GreenHopper uses defect/bug. the term bug, while Rally and VersionOne use defect.) 7 Select and view a test. Basic access to a test case. 8 Create an Common management chore. (Note: This had to be manually timed iteration/sprint. for JIRA+GreenHopper, as measured time was about 0.3 seconds while elapsed time was 17 seconds.) 9 Move a story to an Common developer or manager chore. (Note: JIRA+GreenHopper iteration/sprint. and VersionOne use the term sprint, while Rally uses iteration.) 10 Convert a story to a Common developer chore (Note: This operation is not applicable defect/bug. to Rally because of the inherent hierarchy between a story and its defects). 9. Test Results performed badly (subjectively). As such, the leader in a test is given the “Very Good” Each test was performed 1+10 times in rating, which corresponds to five points. The sequence for each software system, and the leading time is then used as a base for mean and standard deviation were computed. comparative scoring of competitors for that The point estimates were then compared to test, with each test score based on how many find the fastest performing application. A +n multiples it was of the fastest performer. The (seconds) indicator was used to indicate the point legend table is illustrated below. relative performance lag of the other applications from the fastest performing Time Multiple Points application for that test. 1.0x ≤ time < 1.5x 5 The test result summary table illustrates 1.5x ≤ time < 2.5x 4 the relative performance for each test to allow 2.5x ≤ time < 3.5x 3 observable comparisons per product and per 3.5x ≤ time < 4.5x 2 test. In order to provide a measurement-based 4.5x ≤ time 1 comparison, a scale was created to allow numerical comparison between products. There were no cases where the leader in a test 7
  • 8. Test Result Summary Table (Relative Performance Analysis) Legend Very Good: (5) Good: (4) Acceptable: (3) Poor: (2) Very Poor: (1) System and Overall 1 2 3 4 5 6 7 8 9 Test Rating Backlog Switch Backlog View View View View Create Story Summary (Out of Refresh Backlog Paging Story Task Defect Test Sprint → 45) Sprint Rally 43          VersionOne 32          JIRA+ GreenHopper 18          It must be noted that resulting means are (symmetrically distributed), and 95% should point-estimate averages. For several reasons, lie within two standard deviations. We we don’t suggest or use confidence intervals or graphically tested for normality using our test for significance. Based on the challenges calibration data and observed our data to be associated with structuring common tests with normally distributed. When there is no overlap different interfaces, different data structures, between timing at two standard deviations, this and no guarantee of connection quality, it is implies it will be fairly rare for one of the extraordinarily difficult to do so. In addition, typically slower performing applications to because each test may have a different weight exceed the performance of the faster or relevance to each customer depending on application (for that particular test). If there is their ALM process, the relevance of a test no overlap at one or two standard deviations leader should be weighted according to the between the lower and upper bounds, the result preference of the reader. That being said, these is marked as “Significant.” If there is overlap tests are intended to reflect the user in one or both cases, that result is flagged as experience. To address some of the concerns “Insignificant.” Significance is assessed associated with point estimates, analysis of between the fastest performing application for high and low bounds based on one and two the test and each of the other two applications. standard deviations was performed. If the high Therefore, the significance analysis is only bound for the fastest test overlaps with the low populated for the application with the fastest bound for either of the slower performing point estimate. The advantage is classed as application tests, the significance of the insignificant if the closest performing peer performance gain between those comparisons implies the result is insignificant. All data is questionable. The overlap suggests there values are in seconds. will be cases where the slower (overlapping) Results from each test are analyzed application may perform faster than the separately below. The results of each test are application with the fastest response time. shown both in table form with values and in Statistical theory and the three-sigma rule bar graph form, and are also interpreted in the suggest that when data is normally distributed, text below the corresponding table. Note that roughly 68% of observations should lie within long bars in the comparison graphs are long one standard deviation of the mean response times, and therefore bad. 8
  • 9. Test 1: Refresh Backlog Page for a Single Project System Mean Standard Point 1 SD 1 SD 2 SD 2 SD Request Deviation Estimate Range Overlap Range Overlap Time (seconds) Comparison (seconds) Analysis (seconds) Analysis (seconds) (seconds) JIRA+ 15.27 1.38 +12.13 13.89 – - 12.52 – - GreenHopper 16.64 18.02 Rally 5.53 0.29 +2.39 5.24 – - 4.95 – - 5.81 6.10 VersionOne 3.14 0.25 Fastest 2.88 – Significant 2.63 – Significant 3.39 3.64 Interpretation: The data indicates that for this almost 2.4 seconds. Both VersionOne and particular task, even when accounting for Rally perform significantly better than variance in performance, VersionOne JIRA+GreenHopper when executing this performs fastest. Note that the advantage is operation. relatively small when compared to Rally, though the Rally point estimate does lag by Best Performer: VersionOne 9
  • 10. Test 2: Switch Backlog Views Between Two Projects System Mean Standard Point 1 SD 1 SD 2 SD 2 SD Request Deviation Estimate Range Overlap Range Overlap Time (seconds) Comparison (seconds) Analysis (seconds) Analysis (seconds) (seconds) JIRA+ 13.84 0.83 +11.39 13.01 – - 12.19 – - GreenHopper 14.66 15.49 Rally 2.45 0.16 Fastest 2.29 – Significant 2.13 – Significant 2.60 2.76 VersionOne 2.94 0.07 +0.49 2.87 – - 2.79 – - 3.01 3.08 *To perform this operation on JIRA+GreenHopper, the user must navigate between two scrumboards and then load the data. Therefore, the timing numbers for JIRA+GreenHopper are the sum of two measurements. This introduces request overhead not present in the other two tests, yet the disparity suggests more than just simple transaction overhead is the cause of the delay. Furthermore, the resulting page was rendered frozen and was not usable for an additional 10 – 15 seconds. Users would probably pool that additional delay before the page could be accessed in their user experience impression, but it was not included here. Interpretation: The data indicates that Rally user interaction, the experience would be and VersionOne are significantly faster than similar for the two products. JIRA+GreenHopper, even when considering the sum of two operations. Rally is faster than Best Performer: Rally VersionOne, though marginally so. In terms of 10
  • 11. Test 3: Paging Through Backlog List System Mean Standard Point 1 SD 1 SD 2 SD 2 SD Request Deviation Estimate Range Overlap Range Overlap Time (seconds) Comparison (seconds) Analysis (seconds) Analysis (seconds) (seconds) JIRA+ 1.53 0.66 Fastest 0.87 – Insignificant 0.21 – Insignificant GreenHopper 2.19 2.85 Rally 1.93 0.11 +0.4 1.81 – - 1.70 – - 2.04 2.15 VersionOne 3.45 0.29 +1.92 3.16 – - 2.87 – - 3.74 4.04 Interpretation: JIRA+GreenHopper had the likely to be comparable. The data indicates fastest point-estimate mean, but the analysis that VersionOne is significantly slower than suggests there is minimal (not significant) the other two systems, and for very large data improvement over Rally, which was the sets like the tests used, this makes scrolling second-fastest. The standard deviations through the data quite tedious. suggest a wider performance variance for JIRA+GreenHopper, and so while the point Best Performer: JIRA+GreenHopper and estimate is better, the overall performance is Rally 11
  • 12. Test 4: Selecting and Viewing a User Story From the Backlog System Mean Standard Point 1 SD 1 SD 2 SD 2 SD Request Deviation Estimate Range Overlap Range Overlap Time (seconds) Comparison (seconds) Analysis (seconds) Analysis (seconds) (seconds) JIRA+ 3.49 0.99 +2.95 2.49 – - 1.50 – - GreenHopper 4.48 5.47 Rally 0.53 .07 Fastest 0.46 – Significant 0.40 – Significant 0.60 0.67 VersionOne 1.90 0.30 +1.36 1.59 – - 1.29 – 2.5 - 2.20 Interpretation: The data indicates that Rally is experience. Rally’s performance is also more significantly faster than either consistent than the other two products (i.e., it JIRA+GreenHopper or VersionOne. While the has a much lower response standard result is significant, the one-second difference deviation). between Rally and VersionOne is not likely to have a significant impact on the user Best Performer: Rally 12
  • 13. Test 5: Selecting and Viewing a Task System Mean Standard Point 1 SD 1 SD 2 SD 2 SD Request Deviation Estimate Range Overlap Range Overlap Time (seconds) Comparison (seconds) Analysis (seconds) Analysis (seconds) (seconds) JIRA+ 1.36 0.17 +0.92 1.20 – - 1.03 – - GreenHopper 1.53 1.69 Rally 0.44 0.03 Fastest 0.42 – Significant 0.39 – Significant 0.47 0.50 VersionOne 1.46 0.16 +1.01 1.29 – - 1.13 – - 1.62 1.78 Interpretation: The data indicates that Rally is VersionOne showed similar performance. significantly (in the probabilistic sense) faster Overall, the result for all applications was than either JIRA+GreenHopper or VersionOne qualitatively good. by about one second, and also has a more consistent response time (with the lowest Best Performer: Rally standard deviation). JIRA+GreenHopper and 13
  • 14. Test 6: Selecting and Viewing a Test Case System Mean Standard Point 1 SD 1 SD 2 SD 2 SD Request Deviation Estimate Range Overlap Range Overlap Time (seconds) Comparison (seconds) Analysis (seconds) Analysis (seconds) (seconds) JIRA+ 1.91 0.86 +1.37 1.05 – - 0.19 – - GreenHopper 2.77 3.64 Rally 0.54 0.13 Fastest 0.41 – Significant 0.28 – Insignificant 0.67 0.80 VersionOne 1.45 0.18 +0.91 1.27 – - 1.09 – - 1.62 1.80 Interpretation: The data indicates that, again, suggesting a consistently better experience. Rally is fastest in this task, though the speed VersionOne was second in terms of differences are significant at the one standard performance, followed by deviation level where there is no overlap in JIRA+GreenHopper. their respective timing ranges, but not at two standard deviations. Rally performed with the Best Performer: Rally lowest point estimate and the lowest variance, 14
  • 15. Test 7: Selecting and Viewing a Defect/Bug System Mean Standard Point 1 SD 1 SD 2 SD 2 SD Request Deviation Estimate Range Overlap Range Overlap Time (seconds) Comparison (seconds) Analysis (seconds) Analysis (seconds) (seconds) JIRA+ 1.70 0.81 +1.02 0.88 – - 0.07 – - GreenHopper 2.51 3.32 Rally 0.68 0.05 Fastest 0.63 – Significant 0.58 – Insignificant 0.72 0.77 VersionOne 1.74 0.17 +1.06 1.56 – - 1.39 – - 1.91 2.08 Interpretation: The data indicates that Rally is very low standard deviation. Though the point faster by roughly one second based on the estimates are very close, the performance of point-estimate mean when compared to the VersionOne is preferred based on the low other two products, with the difference being standard deviation. That being said, given that significant at the one standard deviation level the point estimates are all below two seconds, but not at two standard deviations. Variance in there would be little to no perceptible the results of the other products suggests they difference between VersionOne and will perform similarly to Rally on some JIRA+GreenHopper from a user perspective. occasions, but not all. Rally’s performance was relatively consistent, as indicated by the Best Performer: Rally 15
  • 16. Test 8: Add an Iteration/Sprint System Mean Standard Point 1 SD 1 SD 2 SD 2 SD Request Deviation Estimate Range Overlap Range Overlap Time (seconds) Comparison (seconds) Analysis (seconds) Analysis (seconds) (seconds) JIRA+ 17.76 0.60 +17.72 17.16 – - 16.56 – - GreenHopper 18.36 18.96 Rally 0.04 0.00 Fastest 0.04 – Significant 0.03 – Significant 0.05 0.05 VersionOne 1.36 0.10 +1.32 1.25 – - 1.15 – - 1.46 1.57 *Due to the disparity between Rally and JIRA+GreenHopper here, the graph appears to show no data for Rally. The graph resolution is simply insufficient to render the data clearly, given the large value generated by JIRA+GreenHopper tests. **The JIRA+GreenHopper data was manually measured due to inconsistencies in timing versus content rendering. Based on requests, it appeared asynchronous page timings were completing when requests were submitted, and the eventual content updates and rendering were disconnected from the original request being tracked. While this increases the measurement error, it certainly would not account for a roughly 17-second disparity. Interpretation: Rally is the fastest performer JIRA+GreenHopper is many times slower than in this test, with the results being significant at both Rally and VersionOne. both the one and two standard deviation levels. Best Performer: Rally 16
  • 17. Test 9: Move a Story to an Iteration/Sprint System Mean Standard Point 1 SD 1 SD 2 SD 2 SD Request Deviation Estimate Range Overlap Range Overlap Time (seconds) Comparison (seconds) Analysis (seconds) Analysis (seconds) (seconds) JIRA+ 9.80 6.88 +8.42 2.91 – - 0.00* – - GreenHopper 16.68 23.56 Rally 3.37 0.22 +1.99 3.15 – - 2.94 – - 3.59 3.80 VersionOne 1.38 0.36 Fastest 1.02 – Significant 0.66 – Insignificant 1.74 2.09 *The standard deviation range suggested a negative value, which is, of course, impossible. Therefore, 0.00 is provided. Interpretation: The data indicates that test is a result of the enormous standard VersionOne is fastest for this operation. The deviation of the JIRA+GreenHopper tests. insignificant two standard deviation overlap Best Performer: VersionOne 17
  • 18. Test 10: Convert a Story to a Defect/Bug System Mean Standard Point 1 SD 1 SD 2 SD 2 SD Request Deviation Estimate Range Overlap Range Overlap Time (seconds) Comparison (seconds) Analysis (seconds) Analysis (seconds) (seconds) JIRA+ 26.56 2.94 +24.87 23.62 – - 20.68 – - GreenHopper 29.50 32.44 Rally 1.69 0.25 Fastest 1.44 – Significant 1.19 – Significant 1.94 2.19 VersionOne 6.06 0.28 +4.36 5.77 – - 5.49 – - 6.34 6.62 *JIRA+GreenHopper required manual timing. See the interpretation below for explanation. Interpretation: This operation is an example update for about 10 seconds while it updated of one in which the procedure in each system the icon to the left of the new defect from a is completely different and perhaps not green story icon to a red defect icon. This extra comparable in any reasonable way. In 10 seconds was not included in the timing JIRA+GreenHopper, there are three operations results, although perhaps it should have been. involved (access the story, invoke the editor, In Rally, defects are hierarchically below and after changing the type of issue, saving the stories as one of a story’s attributes, and so a changes and updating the database) and these story cannot be converted to a defect, though had to be manually timed. In addition, the defects can be promoted to stories. That is JIRA+GreenHopper page froze after the what we measured for Rally’s case. And 18
  • 19. finally, VersionOne has a menu option to do scrumboard, which JIRA+GreenHopper this task. The results, reported here just for implements with the plug-in GreenHopper. interest and not defensible statistically, The GreenHopper overlay/add-on seemed indicate that Rally is fastest at this class of unable to handle the large data sets effectively. operation, followed by VersionOne at plus- When we tried to include the test of viewing four seconds and JIRA+GreenHopper at +24 the backlog for all projects, we were able to do seconds. so for Rally and VersionOne, but the JIRA+GreenHopper instance queried for over Best Performer: N/A – Informational 12 hours without rendering the scrumboard observations only. and merged project backlog. Some object view operations resulted in second-best performance for JIRA+GreenHopper, but with the 10. Conclusions exception of viewing tasks, the variance associated with request was extraordinarily Our testing was by no means exhaustive, but high compared to Rally and VersionOne. The thorough enough to build a reasonably sized large variance will manifest to users as an result set to enable comparison between inconsistent experience (in terms of response applications. It fundamentally aimed to assess time) when performing the same operation. the performance of testable elements that are Anecdotally, the performance of consistent between applications. We tried to VersionOne compared to Rally was choose simple, small tests that mapped well significantly degraded when import activity between the three systems and could be was taking place, to the extent that measured programmatically as opposed to VersionOne becomes effectively unusable manually (and succeeded in most cases, during import operations. Further testing could though some manual timing was required). be performed to identify whether this is a Rally was the strongest performer based on CSV-limited import issue or if it extends to the test results, leading outright in six of the programmatic API access, as well. Given how nine that were compared. In one of these six many platforms utilize API access regularly, it tests, Rally tied with VersionOne from a would be interesting to explore this result scoring perspective in terms of relative further. performance (using the scoring system Both Rally and VersionOne appear to developed for comparisons), though it led provide a reasonable user experience that from a raw measured-speed perspective. In should satisfy customers in most cases when one test not included in the six, Rally tied with the applications are utilizing large data sets JIRA+GreenHopper from a numeric with over 500,000 artifacts. perspective and within the bounds of the JIRA+GreenHopper is significantly scoring model that was established. disadvantaged from a performance VersionOne was the strongest performer in perspective, and seems less suitable for two of the nine tests, and exhibited very customers with large artifact counts or with similar performance characteristics (generally aggressive growth expectations. Factors such within a 1 – 12 second margin) in many of the as user concurrency, variations in sprint tests that Rally led. JIRA+GreenHopper did structure, and numerous others have the not lead any tests, but as noted, tied with Rally potential to skew results in either direction, for one. and it is difficult to predict how specific use With the exception of backlog paging, cases may affect performance. These tests do, JIRA+GreenHopper trailed in tests that however, provide a reasonable comparative leveraged agile development tools such as the 19
  • 20. baseline, suggesting Rally has a slight performance advantage in general, followed closely by VersionOne. References A variety of references were used to help build and execute a performance testing methodology that would allow a reasonable, statistically supported comparison of the performance of the three ALM systems. In addition to documentation available at the websites for each product, the following resources were used: “Agile software development.” Wikipedia. Accessed Sept. 28, 2012 from http://en.wikipedia.org/wiki/Agile_soft ware_development. Beedle, Mike, et al. “Manifesto for Agile Software Development.” Accessed Sept. 28, 2012 from http://agilemanifesto.org. Hewitt, Joe, et al. Firebug: Add-ons for Firefox. Mozilla. Accessed Sept. 28, 2012 from http://addons.mozilla.org/en- us/firefox/addon/firebug. Honza. “Firebug Net Panel Timings.” Software is Hard. Accessed Sept. 28, 2012 from http://www.softwareishard.com/blog/fi rebug/firebug-net-panel-timings. Peter. “Top Agile and Scrum Tools – Which One Is Best?” Agile Scout. Accessed Sept. 28, 2012 from http://agilescout.com/best-agile-scrum- tools. 20