1. Built around answering questions:
1
Interesting pattern.
What does it mean?
Singapore to Taiwan
via LA?
Why so slow?
2. EPOC and NetSage
for 4NRP
Jennifer Schopf Jason Zurawski
TACC / UT Austin ESnet
EPOC/NetSage is supported NSF award #1826994
3. Today’s Discussion
• 10-15 minutes about EPOC (and maybe a fun use case there)
• Overview of NetSage and its architecture
• NetSage Basics using the CENIC NetSage Portal
• http://cenic.netsage.global
• Use Cases Walk Through
• Open Discussion – what do you want to know?
3
4. Why an Engagement Operations Center?
• Today’s science is collaborative science
• Collaborative science
• Multiple partners
• Multiple data sets
• Many points of connection
• Cross agency cooperation
• With better access to data we ask harder questions
• Interactive data sources change the science we do
4
6. Engagement and Performance
Operations Center (EPOC)
• Now joint project between Texas Advanced Computing
Center (TACC) and ESnet
• Part of CC* program for domestic science support
• Partnerships with regional, infrastructure, and science
communities that span the NSF and DOE continuum of
funding
• Focus on Smallest Difference for the Biggest Change
7. Core Mission
• Understanding and supporting science use cases
• “Smallest difference for the biggest change”
• Campus, regional, national, and international support
• Debugging any and all network complications via established
measurement and monitoring infrastructure
• Data mobility at all layers of the ecosystem:
• Software, hardware, and network
• Work with anyone, focusing mostly on those who aren’t affiliated
with an R1 or a large NSF center
7
8. Current Regional Partners (13)
8
• Front Range GigaPop (FRGP)
• Great Plains Network (GPN)
• iLight
• KINBER
• Lonestar Education and
Research Network (LEARN)
• NJEdge
• NOAA N-wave
• NYSERNet
• Ohio Academic Resources
Network (OARnet)
• Pacific Northwest GigaPop
(PNWGP)
• Southern Crossroads (SoX)
• Sun Corridor Network (SCN)
• Texas Advanced Computing
Center (TACC)
9. EPOC Five Focus Areas Currently
1. Roadside Assistance and Consulting
2. Application Deep Dives
3. Network Analysis (NetSage)
4. Data Mobility Exhibition/Baseline Testing
5. Training
10. Roadside Assistance
• “This file transfer worked last week, but it doesn’t
anymore?”
• Think of this like a flat tire, crash repair
• Anyone can submit
• Contact epoc@tacc.utexas.edu
• Within 24 hours, gets triaged
• Some initial investigation to verify the issues
• A Case Manager and Lead Engineer are assigned
• Shareable infrastructure set up
• Centralization of Researcher Assistance
10
11. Roadside Assistance - Consulting
• EPOC is an “Ask Me Anything” help desk
• Often simpler questions:
• Suggestions for data architecture choices
• DTNs, DMZs, firewalls
• Data projections for science fields
• Expected (real) performance between two sites
• Advice on how to conduct a performance assessment
• Or others!
• Same operations center approach, aim for 1 business day
turnaround for first response
• 287 RA/C cases to date
12. Roadside Assistance is not “normal network
engineering problem solving”
• We don’t own any of the resources having problems
• We coordinate with the resource owners and the other
networks/systems people
• We can’t always run the tests ourselves
• Must be a collaborative effort
• Best technical choice often isn’t an option
• We try to make the smallest change for the biggest
difference within the constraints we’re handed
13. Soft Failures are Different from Hard
Failures
• Many problems are separated in time
• 2 weeks to 2 months or more?
• Many of the problems aren’t just on/off
• Soft failures or decreased performance
• Start of problem almost never clear
• End goal often isn’t clear either
13
14. RA Focus: Routing Issues
● End-to-end performance debugging often shows a
routing issue
● Asymmetric Routes
● Commercial/Commodity paths chosen over R&E
● Smaller pipe chosen due to stale routing configuration
● BGP routing not normalizing to best path after outage
15
15. Routing Working Group
● Joint Working Group between GNA-G and APAN
○ Led by Warrick Mitchell, AARNet, Brenna Meade and
Hans Addleman, IU
○ ~170 members
○ Monthly meetings with discussions of ongoing routing
cases and occasional tool talks
● https://www.gna-g.net/join-working-group/gna-g-routing-wg/
● List: routing-wg@lists.gna-g.net
○ Contact meadeb@iu.edu to join
16
16. RA Focus: MTU settings impacting performance
● MTU mismatches between networks AND internal to networks
● Non standard MTU changes made or required by commercial DDOS
scrubbing services
● Path MTU Discovery blocked by security appliances and ACL’s
● Quick guide to explain and help fix: https://epoc.global/wp-
content/uploads/About-MTUs.pdf
17
17. RA Current Lessons Learned
• Soft failures are hard (and can come back)
• Smallest change for the biggest difference
• This is NEVER the optimal change for the BEST
outcome
• Socio-political issues are always in play
• We try to document common problems to help as
many others as we can
• Huge need in the community for this type of work
18
18. EPOC Deep Dive Vision
•Think of this as regular maintenance,
oil change, or planning to buy a car
•Based on ESnet facility req’ts reviews
• Walk through science workflow with the
actual scientists
• Way to understand needs and planning
•Often identifies issues that have nothing to do with
networks, and everything to do with sociology
19. Deep Dive Overview
• Formal mechanism via structured conversations to determine
shared understanding of CI needs
• Bring together a cross section of campus
• Network users (researchers)
• Administrators
• Technology providers
• Try to find common problems and paths forward
• In-person component a significant value add
• Eighteen Deep Dive reports available at
https://epoc.global/materials
20. Deep Dive: Face-to-Face Discussion
• Bring together researchers, IT staff, research admin
• Create a shared vision to go forward
• Share information for strategic programs, initiatives
• Guide organizational strategy
• Build relationships with constituents
• Identify and resolve network-related issues, existing or
anticipated
21. We Walk Through Scientific
Components…
1. Background information
• Brief overview of the facility, nature of the science being
performed
2. Collaborators
• Identify people and institutions that a science group interacts
with
3. Instrumentation
• Local and remote scientific instruments and facilities.
4. Process of Science
• Explain ‘a day in the life’ of the science group
• Should tie together the instruments, the people, and the
resources
22. And Also More Technical Aspects…
5. Software Infrastructure
6. Network and Data Architecture
7. Cloud Services
8. Outstanding Issues and Pain Points
Local and regional IT staff are critical for this information, and help
form valuable partnerships that may not exist or could use
strengthening
23. Deep Dive: Outputs
• Identify and analyze technical gaps/bottlenecks or
opportunities
• Forecast technology/network capacity needs, particularly
in regions where a site is anticipating increases or
decreases in data l
• Help inform investments in network improvements,
bandwidth needs, or other application services
• Create long-term, relationships with researchers, IT staff
and administration to provide ongoing consultation and
support
24. Deep Dives So Far
• Eighteen Deep Dive reports available at
https://epoc.global/materials
• Four more in various stages
• Prep, meeting, writing
• We get MANY more requests than we can do
• Train the trainers with regional partners going slowly
26
25. Data Mobility Exhibition (DME)/
Baseline Performance Testing
• One TeraByte of data in an hour
• Equivalent to 2.22 Gb/s average
• Achievable for institutions connected at 10G
• How to find out?
• DME has known good endpoints to test against
• Variety of data sizes you can transfer
• Standard Globus set up
• And if you can’t?
• Work with EPOC to find the bottlenecks!
27. Training
• Follow on to OIN (http://oinworkshop.com)
• Reached over 750 people in 3 years
• Hands on perfSONAR sessions
• Especially for small nodes, includes file transfer tests
• “How to talk to Scientists”
• DMZ/DTN Set Up
• Hard part - shifting to more use, less install
28. Joint work with University South Carolina
• Two-day online hands-on workshops
• Often joint with a regional network
• Introduction to tools and techniques for the design,
implementation, and monitoring
• Lab exercises on “pods” emulating networks and
tools
• Topics
• Network tools and architecture
• Use of perfSONAR
• BGP attributes and configuration
http://ce.sc.edu/cyberinfra/workshop_2022.html
33
34. Any Questions on the Rest of EPOC Before I
go into NetSage in detail?
1. Roadside Assistance and Consulting
2. Application Deep Dives
3. Network Analysis (NetSage)
4. Data Mobility Exhibition/Baseline Testing
5. Training
35. Monitoring using NetSage
• NetSage advanced measurement services for R&E traffic
• Better understanding of current traffic patterns across
instrumented circuits
• Better understanding of large flow sources/sinks
• Performance information for data transfers
• Started as collaboration between Indiana University, LBNL,
and University Hawaii Manoa
• Now all development at TACC
• Backend support/Deployments at both TACC and IU
• 2021: 2,500+ unique users in 85+ different countries
41
36. NetSage Data Sources
• SNMP data (Passive)
• Basic bandwidth data
• perfSONAR (Active)
• Active tests between sites
• Flow data from routers (Passive)
• Only de-identified data collected by NetSage
• Tstat-based traffic analysis for archives (Passive)
• TCP flow statistics: congestion window size, number of packets
retransmitted, etc
• Also de-identified before stored
42
38. Flow Data collection
• Flow data is redirected to a collection point, de-identified, and then
sent to NetSage archive
• Collection point
• IU collection point - BEING DISCONTINUED
• Docker container on resources at site’s institution
• https://netsage-project.github.io/netsage-
pipeline/docs/deploy/docker_install_simple
• Docker container run as a service on an existing server
• Linux or MacOS
• Can be anywhere your router has access to across regular IP routing
• If you choose this option, you have to do updates, not us
44
39. NetSage Privacy
• NetSage is committed to privacy, and preemptively
addressing any security or data sharing concerns
• No Personally Identifiable Information (PII) collected
• Remove the last octet from IP address
• Only keep data on flows over 10M for circuits
• 1M for archives
• Data Privacy Policy
• http://www.netsage.global/home/netsage-privacy-policy
• Data Flow Data Retention (De-Identification) Policy
• https://tinyurl.com/netsage-deid
• Prototypes are behind a password until we’re told to make it
public
45
40. NetSage - Built around answering questions
46
• Answers questions asked by network
engineers, network owners, and end-
users
• Human-readable summaries and
patterns
• Big picture overview helps highlight
trends and events that can make in-
depth analysis of local data more fruitful
41. Built around answering questions:
47
Interesting pattern.
What does it mean?
Singapore to Taiwan
via LA?
Why so slow?
42. NetSage Focus on Use Cases and Questions
• Flow Data Dashboards
• What are the top sites using my circuits?
• What are the top sources/destinations for an organization?
• Who’s using my archive?
• Debugging dashboards
• What are the flows like between these two orgs?
• There was a performance spike on my circuit – what was
it?
• Who’s transferring a lot of data really slowly?
• If SNMP data then Bandwidth Dashboard:
• How much are the links used?
• Where are congestion points?
48
45. NetSage and ACCESS
• We’re part of the Measurement and Monitoring Service (Track 4)
• You may know this as XDMoD
• We’ll get NetSage data at the edge of each Resource Provider
• Look at flows between sites
• Work with Operations to improve them
51
47. General walk through
• Flow data https://cenic.netsage.global/grafana/d/xk26IFhmk/flow-data-for-circuits?orgId=2
• SDSC Flows https://cenic.netsage.global/grafana/d/QfzDJKhik/flow-data-per-organization?orgId=2&var-
Organization=San%20Diego%20Supercomputer%20Center&var-Sensors=All&var-country_scope=All&var-
is_net_test=yes
• SDSC-Continuous beam https://cenic.netsage.global/grafana/d/-l3_u8nWk/individual-flows?var-
src=San%20Diego%20Supercomputer%20Center&var-
dest=Continuous%20Electron%20Beam%20Accelerator%20Facility&from=1675170997253&to=1675775797253&orgId
=2
• Zoom into on flow https://cenic.netsage.global/grafana/d/nzuMyBcGk/flow-information?var-
flow=a0abc1b26fc211784bc6b108983f4afe0fe2a5a4787f64599659d7b273cf8a9a&from=1675170997253&to=1675775797253&var-
timestamp=1675345477483&orgId=2
• Flow data summary info https://cenic.netsage.global/grafana/d/CJC1FFhmz/other-flow-stats?orgId=2
53
48. SDSC to MREN
• https://cenic.netsage.global/grafana/d/-l3_u8nWk/individual-flows?var-
src=San%20Diego%20Supercomputer%20Center&from=1667275200000&to=1675227599000&orgId=2&v
ar-dest=Metropolitan%20Research%20and%20Education%20Network&var-subnet=&var-
sensors=All&var-country_scope=All&var-is_net_test=yes
• Port 5201 https://www.speedguide.net/port.php?port=5201
54
49. Flows by Country
• https://cenic.netsage.global/grafana/d/fgrOzz_mk/flow-data-per-
country?orgId=2&from=1667275200000&to=1675227599000
• South Korea example
• https://cenic.netsage.global/grafana/d/80IVUboZk/individual-flows-per-country?var-
dest=South%20Korea&var-
src=United%20States&from=1667275200000&to=1675227599000&orgId=2
• December 15 change https://cenic.netsage.global/grafana/d/80IVUboZk/individual-flows-per-
country?var-dest=South%20Korea&var-
src=United%20States&from=1671091200000&to=1671177599000&orgId=2
• Zoom in by org https://cenic.netsage.global/grafana/d/-l3_u8nWk/individual-flows?orgId=2&var-
dest=KISTI&var-subnet=&var-sensors=All&var-country_scope=All&var-is_net_test=yes&var-
src=Oak%20Ridge%20National%20Laboratory&from=1671091200000&to=1671177599000
55
50. Flow data by country 2
• https://cenic.netsage.global/grafana/d/fgrOzz_mk/flow-data-per-country?orgId=2
• South korea https://cenic.netsage.global/grafana/d/80IVUboZk/individual-flows-per-country?var-
dest=South%20Korea&var-src=United%20States&from=1675270917609&to=1675875717609&orgId=2
• Pick first flow – stanford
• https://cenic.netsage.global/grafana/d/nzuMyBcGk/flow-information?var-
flow=ede94327cad9e7ec6fed44c905b8a176e0cc846fcf3e1f9ba73d97adb9cfc684&from=1675270917609
&to=1675875717609&var-timestamp=1675498682914&orgId=2
• Flow data for projects – Image Net
• https://cenic.netsage.global/grafana/d/ie7TeomGz/flow-data-for-projects?orgId=2&var-project_type=ImageNet&var-
sensors=All&var-is_net_test=yes
• Check out the map!!
• What is image net? https://www.image-net.org/about.php
56
51. Top Talkers Over time
• https://cenic.netsage.global/grafana/d/b35BWxAZz/top-talkers-over-time?orgId=2&from=now-
1y&to=now
• NOAA vanishes
• https://cenic.netsage.global/grafana/d/QfzDJKhik/flow-data-per-organization?orgId=2&var-
Organization=National%20Oceanic%20and%20Atmospheric%20Administration&var-
Sensors=All&var-country_scope=All&var-is_net_test=yes&from=now-1y&to=now
• https://cenic.netsage.global/grafana/d/-l3_u8nWk/individual-flows?var-
src=National%20Oceanic%20and%20Atmospheric%20Administration&from=1644241541872&to=16
75777541872&orgId=2
• https://cenic.netsage.global/grafana/d/b35BWxAZz/top-talkers-over-time?orgId=2&var-
Sensors=All&var-Interval=7d&var-
org=National%20Oceanic%20and%20Atmospheric%20Administration&var-
num_lines=1&from=1656658800000&to=1664607599000
57
53. What NetSage Does Best
• Answers questions asked by network engineers and
network owners
• Human-readable summaries and patterns
• Gives people the higher level pattern so they can
narrow down a time frame and then use local tools that
have more detail
• Simplifies and makes accessible basic data
63
54. Takeaways
• EPOC resources are available to anyone and everyone
• NetSage can help you understand data transfers - when your
data touches one of our collection points
• CENIC Dashboard: https://cenic.netsage.global
• Questions? Contact:
• Jennifer Schopf - jms@tacc.utexas.edu
• Jason Zurawski – zurawski@es.net
IRNC NetSage was funded by US NSF award #1540933
EPOC is funded by US NSF award #1826994 64