SlideShare uma empresa Scribd logo
1 de 14
Avro2TF: A Data Processing
Engine for TensorFlow
Xuhong Zhang, Wensheng Sun, Chenya Zhang, Yiming Ma
AI Computing Foundation Team
Tensor Data Preparation: Avro2TF
- A data preprocessing component under TensorFlowIn.
- Read raw user input data with any format supported by Spark.
Generate tensor metadata (e.g.shape, cardinality, dtype).
Generate Avro- or TFRecord-based training data.
Make your training data ready to be consumed by TF!
2
TensorFlowIn Overview
3
Model Development Pipeline
Avro2TF Overview
6
Tensorize
Text
Feature
Categorical
Feature
Tensor Data Preparation: Avro2TF
Deep dive into the user config.
- Example of a tensorizeIn config.
7
columnConfig for NTV only
Array of word tokens
If need Id conversion
Tensor Data Preparation: Avro2TF
Deep dive into the user config.
Shape.
- []: scalar;
- [-1] : 1D array of any length;
- [6]: 1D array of length 6;
- [2, 3]: a matrix with 2 rows and 3 columns.
(Where is “shape” used?)
Notice:
We don’t do any reshaping here. The shape is just the shape of your raw data.
8
Tensor Data Preparation: Avro2TF
Deep dive into the user config.
- For categorical/sparse features, we require them to be represented in NTV
(name-term-value) format.
9
- We also support the following primitive types:
int, long, float, double, String, bytes (for multimedia data such as image, audio, and video), boolean,
enum, Array[NTV].
Tensor Data Preparation: Avro2TF
Deep dive into the tensorize process.
Feature Mapping Table Generation.
- Requirement:
String-based indices Numerical Id-based indices
- Usage:
1) Mapping table is limited (only based on training data). [Starting from 0]
2) If not available maps to same Id with UNK (unknown) token. [= Cardinality]
3) User can provide their own mapping table.
4) Different training can use the same mapping table.
10
Talk later!
Tensor Data Preparation: Avro2TF
Deep dive into the tensorize process.
Feature Mapping Table Generation.
- Feature: NTV format. - Feature: A list of String.
Each row = name + term Each row = single word
11
Tensor Data Preparation: Avro2TF
Deep dive into the tensor metadata.
Cardinality Computation:
- sparseVector: the number of unique name + term across all records .
- String: the number of unique Strings across all records.
- long or int: the maximum long/int value.
12
Tensor Data Preparation: Avro2TF
Deep dive into the output tensor.
- For categorical/sparse features, we provide a special data type: sparseVector.
Indices: All the ids will be put into an array.
Values: All the values in the NTV tuples will be put into an array.
- We also support the following output tensor data type:
int, long, float, double, String, boolean, bytes, sparseVector.
13
14
More Examples on Supported Data Types
Tensor Metadata
Avro2TF

Mais conteúdo relacionado

Mais procurados

Base64 Encoder And Decoder Transformer With Mule ESB
Base64 Encoder And Decoder Transformer With Mule ESBBase64 Encoder And Decoder Transformer With Mule ESB
Base64 Encoder And Decoder Transformer With Mule ESBJitendra Bafna
 
Python command line_14_12_2020
Python command line_14_12_2020Python command line_14_12_2020
Python command line_14_12_2020Sugnan M
 
Tools for reading papers
Tools for reading papersTools for reading papers
Tools for reading papersJack Fox
 
Understanding the components of standard template library
Understanding the components of standard template libraryUnderstanding the components of standard template library
Understanding the components of standard template libraryRahul Sharma
 
12. standard library introduction
12. standard library introduction12. standard library introduction
12. standard library introductionVahid Heidari
 
Introduction to programming in MATLAB
Introduction to programming in MATLABIntroduction to programming in MATLAB
Introduction to programming in MATLABmustafa_92
 
32sql server
32sql server32sql server
32sql serverSireesh K
 
Python advanced 3.the python std lib by example –data structures
Python advanced 3.the python std lib by example –data structuresPython advanced 3.the python std lib by example –data structures
Python advanced 3.the python std lib by example –data structuresJohn(Qiang) Zhang
 
R data-structures-3
R data-structures-3R data-structures-3
R data-structures-3Victor Ordu
 
(Kpi summer school 2015) theano tutorial part1
(Kpi summer school 2015) theano tutorial part1(Kpi summer school 2015) theano tutorial part1
(Kpi summer school 2015) theano tutorial part1Serhii Havrylov
 
Python advanced 3.the python std lib by example – algorithm
Python advanced 3.the python std lib by example – algorithmPython advanced 3.the python std lib by example – algorithm
Python advanced 3.the python std lib by example – algorithmJohn(Qiang) Zhang
 

Mais procurados (20)

Base64 Encoder And Decoder Transformer With Mule ESB
Base64 Encoder And Decoder Transformer With Mule ESBBase64 Encoder And Decoder Transformer With Mule ESB
Base64 Encoder And Decoder Transformer With Mule ESB
 
Standard Library Functions
Standard Library FunctionsStandard Library Functions
Standard Library Functions
 
Network Security Lec4
Network Security Lec4Network Security Lec4
Network Security Lec4
 
2CPP16 - STL
2CPP16 - STL2CPP16 - STL
2CPP16 - STL
 
Python command line_14_12_2020
Python command line_14_12_2020Python command line_14_12_2020
Python command line_14_12_2020
 
Msc 1
Msc 1Msc 1
Msc 1
 
Tools for reading papers
Tools for reading papersTools for reading papers
Tools for reading papers
 
Link List
Link ListLink List
Link List
 
Understanding the components of standard template library
Understanding the components of standard template libraryUnderstanding the components of standard template library
Understanding the components of standard template library
 
30csharp
30csharp30csharp
30csharp
 
12. standard library introduction
12. standard library introduction12. standard library introduction
12. standard library introduction
 
Editors l21 l24
Editors l21 l24Editors l21 l24
Editors l21 l24
 
Lecture1
Lecture1Lecture1
Lecture1
 
Introduction to programming in MATLAB
Introduction to programming in MATLABIntroduction to programming in MATLAB
Introduction to programming in MATLAB
 
32sql server
32sql server32sql server
32sql server
 
Python advanced 3.the python std lib by example –data structures
Python advanced 3.the python std lib by example –data structuresPython advanced 3.the python std lib by example –data structures
Python advanced 3.the python std lib by example –data structures
 
R data-structures-3
R data-structures-3R data-structures-3
R data-structures-3
 
Functions class11 cbse_notes
Functions class11 cbse_notesFunctions class11 cbse_notes
Functions class11 cbse_notes
 
(Kpi summer school 2015) theano tutorial part1
(Kpi summer school 2015) theano tutorial part1(Kpi summer school 2015) theano tutorial part1
(Kpi summer school 2015) theano tutorial part1
 
Python advanced 3.the python std lib by example – algorithm
Python advanced 3.the python std lib by example – algorithmPython advanced 3.the python std lib by example – algorithm
Python advanced 3.the python std lib by example – algorithm
 

Semelhante a Avro2 tf: a data processing engine for tensorflow

Sqlmaterial 120414024230-phpapp01
Sqlmaterial 120414024230-phpapp01Sqlmaterial 120414024230-phpapp01
Sqlmaterial 120414024230-phpapp01Lalit009kumar
 
Standardizing on a single N-dimensional array API for Python
Standardizing on a single N-dimensional array API for PythonStandardizing on a single N-dimensional array API for Python
Standardizing on a single N-dimensional array API for PythonRalf Gommers
 
Learn about Tensorflow for Deep Learning now! Part 1
Learn about Tensorflow for Deep Learning now! Part 1Learn about Tensorflow for Deep Learning now! Part 1
Learn about Tensorflow for Deep Learning now! Part 1Tyrone Systems
 
Introduction To Using TensorFlow & Deep Learning
Introduction To Using TensorFlow & Deep LearningIntroduction To Using TensorFlow & Deep Learning
Introduction To Using TensorFlow & Deep Learningali alemi
 
Think Like Spark: Some Spark Concepts and a Use Case
Think Like Spark: Some Spark Concepts and a Use CaseThink Like Spark: Some Spark Concepts and a Use Case
Think Like Spark: Some Spark Concepts and a Use CaseRachel Warren
 
Tamir Dresher - DotNet 7 What's new.pptx
Tamir Dresher - DotNet 7 What's new.pptxTamir Dresher - DotNet 7 What's new.pptx
Tamir Dresher - DotNet 7 What's new.pptxTamir Dresher
 
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...Chris Fregly
 
Mod01 tns e overview
Mod01 tns e overviewMod01 tns e overview
Mod01 tns e overviewPeter Haase
 
BKK16-103 OpenCSD - Open for Business!
BKK16-103 OpenCSD - Open for Business!BKK16-103 OpenCSD - Open for Business!
BKK16-103 OpenCSD - Open for Business!Linaro
 
running stable diffusion on android
running stable diffusion on androidrunning stable diffusion on android
running stable diffusion on androidKoan-Sin Tan
 
OSA Con 2022: Streaming Data Made Easy
OSA Con 2022:  Streaming Data Made EasyOSA Con 2022:  Streaming Data Made Easy
OSA Con 2022: Streaming Data Made EasyTimothy Spann
 
OSA Con 2022 - Streaming Data Made Easy - Tim Spann & David Kjerrumgaard - St...
OSA Con 2022 - Streaming Data Made Easy - Tim Spann & David Kjerrumgaard - St...OSA Con 2022 - Streaming Data Made Easy - Tim Spann & David Kjerrumgaard - St...
OSA Con 2022 - Streaming Data Made Easy - Tim Spann & David Kjerrumgaard - St...Altinity Ltd
 
Explore Deep Learning Architecture using Tensorflow 2.0 now! Part 2
Explore Deep Learning Architecture using Tensorflow 2.0 now! Part 2Explore Deep Learning Architecture using Tensorflow 2.0 now! Part 2
Explore Deep Learning Architecture using Tensorflow 2.0 now! Part 2Tyrone Systems
 

Semelhante a Avro2 tf: a data processing engine for tensorflow (20)

Sqlmaterial 120414024230-phpapp01
Sqlmaterial 120414024230-phpapp01Sqlmaterial 120414024230-phpapp01
Sqlmaterial 120414024230-phpapp01
 
Just entity framework
Just entity frameworkJust entity framework
Just entity framework
 
Standardizing on a single N-dimensional array API for Python
Standardizing on a single N-dimensional array API for PythonStandardizing on a single N-dimensional array API for Python
Standardizing on a single N-dimensional array API for Python
 
Learn about Tensorflow for Deep Learning now! Part 1
Learn about Tensorflow for Deep Learning now! Part 1Learn about Tensorflow for Deep Learning now! Part 1
Learn about Tensorflow for Deep Learning now! Part 1
 
Introduction To Using TensorFlow & Deep Learning
Introduction To Using TensorFlow & Deep LearningIntroduction To Using TensorFlow & Deep Learning
Introduction To Using TensorFlow & Deep Learning
 
Think Like Spark: Some Spark Concepts and a Use Case
Think Like Spark: Some Spark Concepts and a Use CaseThink Like Spark: Some Spark Concepts and a Use Case
Think Like Spark: Some Spark Concepts and a Use Case
 
Arduino reference
Arduino referenceArduino reference
Arduino reference
 
Meetup tensorframes
Meetup tensorframesMeetup tensorframes
Meetup tensorframes
 
Theory1&2
Theory1&2Theory1&2
Theory1&2
 
Tamir Dresher - DotNet 7 What's new.pptx
Tamir Dresher - DotNet 7 What's new.pptxTamir Dresher - DotNet 7 What's new.pptx
Tamir Dresher - DotNet 7 What's new.pptx
 
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
 
Terraform training 🎒 - Basic
Terraform training 🎒 - BasicTerraform training 🎒 - Basic
Terraform training 🎒 - Basic
 
Interm codegen
Interm codegenInterm codegen
Interm codegen
 
Mod01 tns e overview
Mod01 tns e overviewMod01 tns e overview
Mod01 tns e overview
 
BKK16-103 OpenCSD - Open for Business!
BKK16-103 OpenCSD - Open for Business!BKK16-103 OpenCSD - Open for Business!
BKK16-103 OpenCSD - Open for Business!
 
running stable diffusion on android
running stable diffusion on androidrunning stable diffusion on android
running stable diffusion on android
 
OSA Con 2022: Streaming Data Made Easy
OSA Con 2022:  Streaming Data Made EasyOSA Con 2022:  Streaming Data Made Easy
OSA Con 2022: Streaming Data Made Easy
 
OSA Con 2022 - Streaming Data Made Easy - Tim Spann & David Kjerrumgaard - St...
OSA Con 2022 - Streaming Data Made Easy - Tim Spann & David Kjerrumgaard - St...OSA Con 2022 - Streaming Data Made Easy - Tim Spann & David Kjerrumgaard - St...
OSA Con 2022 - Streaming Data Made Easy - Tim Spann & David Kjerrumgaard - St...
 
Keras and TensorFlow
Keras and TensorFlowKeras and TensorFlow
Keras and TensorFlow
 
Explore Deep Learning Architecture using Tensorflow 2.0 now! Part 2
Explore Deep Learning Architecture using Tensorflow 2.0 now! Part 2Explore Deep Learning Architecture using Tensorflow 2.0 now! Part 2
Explore Deep Learning Architecture using Tensorflow 2.0 now! Part 2
 

Último

UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingrknatarajan
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Christo Ananth
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Christo Ananth
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)simmis5
 
Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdfKamal Acharya
 
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTINGMANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTINGSIVASHANKAR N
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Dr.Costas Sachpazis
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performancesivaprakash250
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfKamal Acharya
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...ranjana rawat
 
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
Glass Ceramics: Processing and Properties
Glass Ceramics: Processing and PropertiesGlass Ceramics: Processing and Properties
Glass Ceramics: Processing and PropertiesPrabhanshu Chaturvedi
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxupamatechverse
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxpranjaldaimarysona
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordAsst.prof M.Gokilavani
 

Último (20)

UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
 
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINEDJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)
 
Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdf
 
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTINGMANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
 
Glass Ceramics: Processing and Properties
Glass Ceramics: Processing and PropertiesGlass Ceramics: Processing and Properties
Glass Ceramics: Processing and Properties
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptx
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptx
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 

Avro2 tf: a data processing engine for tensorflow

  • 1. Avro2TF: A Data Processing Engine for TensorFlow Xuhong Zhang, Wensheng Sun, Chenya Zhang, Yiming Ma AI Computing Foundation Team
  • 2. Tensor Data Preparation: Avro2TF - A data preprocessing component under TensorFlowIn. - Read raw user input data with any format supported by Spark. Generate tensor metadata (e.g.shape, cardinality, dtype). Generate Avro- or TFRecord-based training data. Make your training data ready to be consumed by TF! 2
  • 7. Tensor Data Preparation: Avro2TF Deep dive into the user config. - Example of a tensorizeIn config. 7 columnConfig for NTV only Array of word tokens If need Id conversion
  • 8. Tensor Data Preparation: Avro2TF Deep dive into the user config. Shape. - []: scalar; - [-1] : 1D array of any length; - [6]: 1D array of length 6; - [2, 3]: a matrix with 2 rows and 3 columns. (Where is “shape” used?) Notice: We don’t do any reshaping here. The shape is just the shape of your raw data. 8
  • 9. Tensor Data Preparation: Avro2TF Deep dive into the user config. - For categorical/sparse features, we require them to be represented in NTV (name-term-value) format. 9 - We also support the following primitive types: int, long, float, double, String, bytes (for multimedia data such as image, audio, and video), boolean, enum, Array[NTV].
  • 10. Tensor Data Preparation: Avro2TF Deep dive into the tensorize process. Feature Mapping Table Generation. - Requirement: String-based indices Numerical Id-based indices - Usage: 1) Mapping table is limited (only based on training data). [Starting from 0] 2) If not available maps to same Id with UNK (unknown) token. [= Cardinality] 3) User can provide their own mapping table. 4) Different training can use the same mapping table. 10 Talk later!
  • 11. Tensor Data Preparation: Avro2TF Deep dive into the tensorize process. Feature Mapping Table Generation. - Feature: NTV format. - Feature: A list of String. Each row = name + term Each row = single word 11
  • 12. Tensor Data Preparation: Avro2TF Deep dive into the tensor metadata. Cardinality Computation: - sparseVector: the number of unique name + term across all records . - String: the number of unique Strings across all records. - long or int: the maximum long/int value. 12
  • 13. Tensor Data Preparation: Avro2TF Deep dive into the output tensor. - For categorical/sparse features, we provide a special data type: sparseVector. Indices: All the ids will be put into an array. Values: All the values in the NTV tuples will be put into an array. - We also support the following output tensor data type: int, long, float, double, String, boolean, bytes, sparseVector. 13
  • 14. 14 More Examples on Supported Data Types Tensor Metadata Avro2TF