U-SQL Reading & Writing Files (SQLBits 2016)

U-SQL Reading & Writing Files (SQLBits 2016 ADL/U-SQL Pre-Conference)

  1. 1. Michael Rys Principal Program Manager, Big Data @ Microsoft @MikeDoesBigData, {mrys, usql}@microsoft.com U-SQL Reading & Writing Files
  2. 2. • • • • EXTRACT Expression @s = EXTRACT a string, b int FROM "filepath/file.csv" USING Extractors.Csv(encoding: Encoding.Unicode); • Built-in Extractors: Csv, Tsv, Text with lots of options • Custom Extractors: e.g., JSON, XML, etc. OUTPUT Expression OUTPUT @s TO "filepath/file.csv" USING Outputters.Csv(); • Built-in Outputters: Csv, Tsv, Text • Custom Outputters: e.g., JSON, XML, etc. Filepath URIs • Relative URI to default ADL Storage account: "filepath/file.csv" • Absolute URIs: • ADLS: "adl://account.azuredatalakestore.net/filepath/file.csv" • WASB: "wasb://container@account/filepath/file.csv"
  3. 3. • • • • Built-In Extractors and Outputters • Extractors.Csv(), Extractors.Tsv(), Extractors.Text() • Outputters.Csv(), Outputters.Tsv(), Outputters.Text() Parallel Execution Extractors • Every file is stored in Extents of about 250MB • One Extract Vertex gets 4 extract processes each working on one extent • Today: • Upload Data as row-oriented files • Use CR/LF as row-delimiters • This will align row-boundaries to extend boundaries • Otherwise: you can get data corruption or errors Parallel Outputters • Writes parallel extents • Supports ORDER BY • Stitching of extents to files • Meta Data operation for adl:// files • Expensive copy operation for wasb:// files!!! Limits • row size: 4MB • String column: 128kB; byte[]: up to 4MB • SQL.MAP, SQL.ARRAY not supported (transform needed)
  4. 4. • delimiter: column delimiter (char; Text() only) • encoding: file encoding (System.Text.Encoding) • Encoding.[ASCII] (7-bit) • Encoding.BigEndianUnicode • Encoding.Unicode • Encoding.UTF7 • Encoding.UTF8 (This is the default) • Encoding.UTF32 • CAVEAT: No ANSI support yet! • escapeCharacter: escaping of delimiters (including CR/LF) • nullEscape: allows surrogate for null value • quoting: quoted column using " • Default is on • Does NOT guard row delimiter!!! (use escapeCharacter) • rowDelimiter: row delimiter • Default: CR LF • silent: allows skipping rows with invalid number of columns and nulls data type conversion errors (Extractors only) • CAVEAT: Does not skip encoding errors
  5. 5. E_RUNTIME_USER_EXTRACT_INVALID_CHARACTER Invalid character for UTF8 encoding in input stream. Message: Invalid character for UTF8 encoding in input record at around line 0 Resolution: Correct the invalid character in the input file or correct encoding in extractor and try again. Details: 0xFF 0xFE 0x31 0x0 0x9 0x0 0x4D 0x0
  6. 6. • • • • Simple pattern language on filename and path @pattern string = "/input/{date:yyyy}/{date:MM}/{date:dd}/{*}.{suffix}"; • Binds two columns date and suffix • Wildcards the filename • Today: Limits on number of files (between 800 and 3000) Virtual columns EXTRACT name string , suffix string // virtual column , date DateTime // virtual column FROM @pattern USING Extractors.Csv(); • Refer to virtual columns in query to get partition elimination • Virtual columns need to be referenced for DateTime columns and if no wildcard has been given OUTPUT OUTPUT @rs TO "/output/file_{*}.csv" USING Outputters.Csv(); • One file per outputter invocation. * is unique GUID
  7. 7. Additional Resources Documentation Built-in Extractors: https://msdn.microsoft.com/en- us/library/azure/mt621366.aspx Built-in Outputters: https://msdn.microsoft.com/en-us/library/azure/mt621345.aspx FileSet: https://msdn.microsoft.com/en- us/library/azure/mt621294.aspx Sample Data https://github.com/Azure/usql/blob/master/Examples/Samples/Da ta/AmbulanceData/Drivers.txt Sample Project https://github.com/Azure/usql/tree/master/Examples/Builtin- UDOs/
  8. 8. http://aka.ms/AzureDataLake