Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Tushar Mahapatra - Portfolio for recent Projects
1. Career
Portfolio
Name: Tushar Mahapatra
Email: tushar_mahapatra@yahoo.com
As of: August 2010
2. 1.Contents
1. Contents....................................................................................................................................2
2. Introduction...............................................................................................................................3
3. 09/2009 – present: Weather Data ETL.....................................................................................4
3.1 Analysis.................................................................................................................................4
3.2 Design...................................................................................................................................5
3.3 Pentaho Data Integration..................................................................................................5
3.3.1 ‘FTP & ingest METAR files’ solution............................................................................5
3.3.2 ‘Ingest local METAR files’ solution.............................................................................8
4. 09/2009 – 03/2010: SharePoint Training.................................................................................10
4.1 SharePoint Student Project...............................................................................................10
4.2 SharePoint Team Project..................................................................................................11
5. 2008 – 2009: Modeling with XSD, AIXM5, GML, UML............................................................16
6. 2004: Job Queue Framework and Data Extractors.............................................................20
6.1 Design.................................................................................................................................20
6.2 Implementation.................................................................................................................22
Page 2 of 25
3. 2.Introduction
This portfolio describes aspects of select projects I have worked on recently. The projects
covered relate to the period starting from 2004 to the present (08/2010). Except for the
SharePoint training projects, which were for training by SetFocus, all the other projects
were executed as a consultant with the Federal Aviation Administration.
Page 3 of 25
4. 3.09/2009 – present: Weather Data
ETL
3.1 Analysis
Since September 2009, I have been working on the development of a solution for
capturing weather data and saving them to a database. The long-term requirement is
for support for a variety of weather data formats. These formats consist of complex and
archaic codes. Until now, I have been working with the METAR (meteorological report)
and TAF (terminal aerodrome forecast) formats. They are similar to some extent.
Available documentation describing the formats is not sufficiently detailed. Many
variations of these formats exist. I analyzed the specific file formats I was provided with
and described the syntax of these formats in a document using the railroad diagram
construct. The specification is integrated, i.e. both the METAR and the TAF formats were
described using one specification, and common elements were specified only once. The
following picture shows a part of the document.
Page 4 of 25
5. The texts with grey shading are hyperlinks, which make it easier to drill down deeper.
3.2 Design
Based on my findings in the analysis phase, I designed the class diagram shown below to
help me in the implementation of the parsing code. Implementation was in Java using
the Eclipse development tool. I used a plug-in which facilitated model-driven
architecture. Whenever I modified the UML class diagram, appropriate Java code was
generated or modified. Vice versa, when I changed the Java code, the model was
automatically updated. Both the METAR and the TAF reports are modeled in an
integrated manner.
3.3 Pentaho Data Integration
To avail the advantages of open-source code, I researched the open-source ETL tools
available. My comparative analysis steered me towards selecting the Pentaho Data
Integration (PDI) product (formerly called Kettle, and a part of the Pentaho BI Project
suite of business intelligence tools).
To integrate my parsing code with PDI, I developed a ‘METAR Input’ step plug-in. PDI
solutions are packaged as ‘transformations’ and ‘jobs’. ‘Transformations’ contain the
actual ETL functionality, whereas ‘jobs’ help in gluing together other jobs and
transformations with job steps which contain non-ETL functionality. Transformations and
jobs are edited and assembled digrammatically using a PDI editing tool called ‘Spoon’.
Besides being self-documenting, the diagrams are themselves units of executable code,
so all PDI code is self-documenting.
I developed two solutions to implement the ETL of METAR data: one is for downloading
and ingesting METAR files, and the other is for ingesting METAR files already downloaded.
3.3.1 ‘FTP & ingest METAR files’ solution
The ‘FTP & ingest METAR files’ solution is designed to be scheduled to run at the top of
each hour. On each run, 24 METAR files (one for each hour of a day) are downloaded
from NOAA’s NWS website, and then ingested into the database. The following pictures
Page 5 of 25
6. illustrate parts of my solution. The first picture shows the top-level job named
‘FTP_ingest_METAR_files’. A determination is made of the current UTC hour and that is
then used to drive certain actions. For example, log and data files are compressed for
archival twice a day. A list of the names of the 24 METAR files ordered in a certain
sequence is created. That list is then used to drive the FTP and ingestion of the METAR
files.
The following picture shows the ‘Set_session_constants’ transformation opened in Spoon.
The ‘Set fields to session constants’ step is an instance of the ‘Javascript Values’ step
where Javascript is used to determine the current (UTC) hour and today’s and
yesterday’s dates. The configuration dialog for the step is shown opened.
The ‘Generate_METAR_filenames’ transformation is shown opened in Spoon below. The
configuration dialog for the ‘Set METAR_FILE_NAME of each row’ step is shown opened.
This step is also an instance of the ‘Javascript Values’ step. The Javascript code shows
how the names of the METAR files to be processed are generated. The order of the files is
from least recent to most recent.
Page 6 of 25
7. On the left, the ‘FTP_ingest_METAR_file’
job is shown opened in the Spoon editor.
The configuration dialog for the ‘FTP
METAR file’ job step is also shown opened.
This is an instance of the ‘Get a file with
FTP’ job step. It shows how a field which
was previously set to a METAR file name is
being used.
The ‘Ingest_METAR_file’ transformation is shown opened below. The ‘METAR Input’ step
plugin whose development was discussed above is shown in the toolbox. An instance of
the step is shown being used in the transformation. The transformation parses the METAR
file using the new plugin and then inserts the data in the database. Data for three child
tables is denormalized before insertion.
Page 7 of 25
8. The configuration dialog for the ‘Read & parse METAR’ transformation, a part of the new
plugin, is shown opened below.
3.3.2 ‘Ingest local METAR files’ solution
This solution ingests a folder tree containing METAR files and loads the data into a
database. The ‘Ingest local METAR files’ job is shown below. First, a list of all METAR files in
all subfolders is created. That list is then used to determine what folders need to be
processed. For each folder in the list, the ‘Ingest METAR folder’ job is called.
Page 8 of 25
9. The ‘Determine METAR folders’ transformation is shown below.
The ‘Ingest METAR folder’ job is shown below. A list of the METAR files in the folder is first
made. For each file, the ‘Ingest METAR file’ job is called. After ingestion of all the METAR
files in the folder is done, the METAR and log files are zipped and the folder is deleted if it
has no other files.
The ‘Get METAR filenames’ transformation is shown below.
The ‘Ingest METAR file’ job is shown below. The ‘Ingest METAR file’ transformation invoked
by the job is the same one used in the earlier solution discussed in the previous section.
Page 9 of 25
10. 4.09/2009 – 03/2010: SharePoint
Training
Between September 2009 and March 2010, I was enrolled in the SharePoint Training track
of SetFocus’ Master’s Program. This track is an intensive SharePoint training experience
designed to prepare students for development opportunities with Microsoft’s SharePoint
2007 product.
As part of the training, students were expected to complete two projects simulating real-
world projects. The first project was a student project which each student completed
alone. The second project was a team project where all students collaborated in the
completion of the project.
4.1 SharePoint Student Project
The student project was for a fictitious towing company called Acme. We had to design
and establish a SharePoint Solution Management Portal to help manage all of the
SharePoint solutions created for the company. This portal in the company intranet was
required to be created for the company’s SharePoint developers to organize and
manage their solution projects. The following picture is a screenshot of the portal I
developed.
The top section titled ‘Create Solution Sites’ is an instantiation of a
‘CreateSolutionSiteWebPart’ web part I developed. It accepts site creation data from
the user and uses it to create a web site. On the top of the web page, there is a
collection of tabs of which some have names of the form ‘Test Site ##’. These are sites
created by this web part. Below this web part in the section titled ‘Solution Management’
Page 10 of 25
11. is an instantiation of another web part I created named ‘SolutionManagementWebPart’.
This web part used SharePoint’s SPGridView and SPDataSource controls to display a grid-
view of all solution items in the ‘Solution List’ SharePoint list. After this web part is a
SharePoint list created from a ‘Change Management List Definition’ I developed. Next is
an instance of the Content Query Web Part I developed to find and display all Solution
items in the site collection.
4.2 SharePoint Team Project
The team project was also for a fictitious construction company called Acme. We had to
design and establish a SharePoint application to support the company’s towing
providers. We were expected to perform the following:
• Create a SharePoint Application with internal as well as extranet visibility
• Develop an InfoPath document library
• Develop custom workflows
• Implement Content Management
I developed the following InfoPath form which was meant for the user to submit
purchase order data to a form library.
Page 11 of 25
12. Below is another InfoPath form I developed similarly for the invoice form library.
Page 12 of 25
13. I developed an ‘Invoice’ ECB menu item for the purchase order list. The menu item is
shown open below. Selecting the ‘Invoice’ option led to the display of an ASP.NET
application page (shown next) where the user could review purchase order data
retrieved from the purchase orders list, enter invoice data, and submit it to the invoice
form library.
Page 13 of 25
15. The invoice list below shows invoice list items created by the process described above.
Page 15 of 25
16. 5.2008 – 2009: Modeling with XSD,
AIXM5, GML, UML
The TFR Automation team in the FAA is responsible for managing the enhancement and
support of the TFR Automation system. This system facilitates the tracking of TFR’s
(Temporary Flight Restrictions). The system’s web interface is very heavily used and is
consulted by pilots prior to undertaking flying missions. As a member of this team, I
implemented compliance with the AIXM5 standard for data interchange. AIXM is an XML
standard for aeronautical information. It is an extension of another XML standard called
GML which supports the exchange of geographical information.
For the implementation, I had to develop extensions to the AIXM5 standard. To do this, I
had to first understand the requirements of the TFR Automation system in detail. Next, the
requirements had to be modeled using UML using AIXM5 and GML constructs. The UML
model was then converted by scripts into XML schemas (XSD). These schemas were then
used to develop serialization/de-serialization code to read and write XML documents
complying with the schemas. This code was then invoked by the TFR Automation code
for reading and writing AIXM5 compliant documents.
The screenshot below shows the UML model developed in Rational Rose for the project.
Page 16 of 25
17. This UML model was converted by scripts to the following schema shown in pictorial
format in the XMLSpy tool.
Page 17 of 25
18. The text for this XML schema is shown below.
An XML document instance of this schema is shown below with the specification of the
location of the schema highlighted in yellow.
Page 18 of 25
19. A screenshot showing web pages at the TFR web site is shown below. The red circle
encircles a button labeled ‘AIXM5’ which links to the XML document shown earlier
containing the TFR information.
Page 19 of 25
20. 6.2004: Job Queue Framework and
Data Extractors
A common requirement in the Airspace Information Management (AIM) laboratory of
the FAA was the extraction of data from various data stores. Typically, these extractions
would take significant amounts of time to complete and it was not convenient or
reasonable to expect the user to wait for their completion after they were initiated. In
response, I designed and implemented a Job Queue framework which was used as the
basis for various data extraction systems in the lab. The framework was developed
using .NET 1.1, C#, VB.NET, Oracle ODP.NET and ASP.NET. I also developed a couple of
data extractors (‘Offload Extractor’ for traffic data, ‘Obstacles Extractor’ for obstacles
data). Other data extractors were developed using this framework by other developers.
6.1 Design
Some diagrams from the design document which I authored are shown below. The first
diagram is an ‘Architectural Overview’ diagram.
Offload
Offload Extractor Web Application Offload Extractor Windows Service
JobQueue JobQueue
Offload Component Windows Web
Service Application
SDAT Traffic ArcView Intergraph JobQueue
Document Document Document JobQueue Component
Component Component Component
Offload
Database JobQueue Database
The ‘Deployment Diagram’ is shown below.
Page 20 of 25
21. Offload Extractor Web Application Server
Offload
«library»
Database
OffloadExtractorWebApplication
«call»
Offload Extractor Server
«executable»
Offload OffloadExtractorWindowsService
«call»
«library»
JobQueue Offload
Web
«library»
JobQueue
JobQueue Web Application Server JobQueue
Database
«library»
JobQueueWebApplication
JobQueue Server
«executable»
JobQueueWindowsService
«library»
JobQueue
«library»
JobQueue
The object model for the Offload Extractor system is shown below.
Page 21 of 25
22. «implementation class» «implementation class» «implementation class» «implementation class» «implementation class» «implementation class»
DefaultPage SimpleQueryPage AdvancedQueryPage WaitPage JobQueuePage JobDetailsPage
+NavigateToSimpleQuery() +Submit() +Submit() +DeleteSelected()
+NavigateToAdvancedQuery() +NavigateToJobDetails()
+NavigateToJobQueue()
1 1-Accesses 1
-Uses -Uses 1 -Uses
1 -Uses 1 1-Uses -Uses
1 -Uses 1 -Uses
«implementation class» «implementation class»
OffloadExtractorService JobQueueService
-Uses
1
1 1 -Uses -Uses
-Uses 1
«implementation class»
OffloadExtractor
1 -Uses
* -Is accessed by +Extract() *-Is used by -Isby by
* -Is accessed by * *-Is used used
-Accesses #OpenExtract() 1
#SaveFlight()
«implementation class» #CloseExtract() -Is used by «implementation class»
OffloadDB «inherits» JobQueueDB
«inherits»
+FirstMessageDateTime : Date «inherits»
* -Is used by *
+LastMessageDateTime : Date * -Is used by * -Is used by +AddJob()
* -Is* used by* used by used by
-Is -Is * -Is used by
+GetFlights(in strQuery : String) : Object +GetJob()
+ModifyJob()
«implementation class» -Is used by «implementation class» -Is used by «implementation class» +DeleteJob()
OffloadToSDATTrafficExtractor OffloadToArcViewExtractor OffloadToIntergraphExtractor
* *
1
+OpenExtract() +OpenExtract() +OpenExtract()
+SaveFlight() +SaveFlight() +SaveFlight()
-Uses +CloseExtract() +CloseExtract() +CloseExtract()
1 -Generates 1 -Generates 1 -Generates
* -Is used by * -Is generated by * -Is generated by * -Is generated by
«implementation class» «implementation class» «implementation class» «implementation class»
SDATTrafficMerger SDATTrafficDocument ArcViewDocument IntergraphDocument
+MergeFlights() +Open() +Open() +Open()
+Close() +Close() +Close()
+Save() +Save() +Save()
+AddMetadata() +AddMetadata() +AddMetadata()
+AddFlight() +AddFlight() +AddFlight()
6.2 Implementation
Some screenshots of actual web pages currently in production for the Offload Extractor
system are shown below.
The first screenshot is that of the main menu page of the Offload Extractor.
Page 22 of 25
23. The screenshot below is for the web page which shows the Offload Extractor job queue.
Page 23 of 25
24. The web page showing details for an Offload Extractor job is shown below.
Page 24 of 25