1. Outline
Introduction
Architecture
Configuration examples & Usage
Current status & TODO
PgLoader, the parallel ETL for PostgreSQL
Dimitri Fontaine
October 17, 2008
Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
2. Outline
Introduction
Architecture
Configuration examples & Usage
Current status & TODO
Table of contents
1 Introduction
pgloader, the what?
2 Architecture
Main components
Parallel Organisation
3 Configuration examples & Usage
4 Current status & TODO
Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
3. Outline
Introduction
Architecture pgloader, the what?
Configuration examples & Usage
Current status & TODO
ETL
Definition
An ETL process data to load into the database from a flat file.
1 Extract
2 Transform
3 Load
Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
4. Outline
Introduction
Architecture pgloader, the what?
Configuration examples & Usage
Current status & TODO
pgloader’s features
PGLoader will:
Load CSV data
Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
5. Outline
Introduction
Architecture pgloader, the what?
Configuration examples & Usage
Current status & TODO
pgloader’s features
PGLoader will:
Load CSV data
Load pretend-to-be CSV data
Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
6. Outline
Introduction
Architecture pgloader, the what?
Configuration examples & Usage
Current status & TODO
pgloader’s features
PGLoader will:
Load CSV data
Load pretend-to-be CSV data
Continue loading when confronted to errors
Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
7. Outline
Introduction
Architecture pgloader, the what?
Configuration examples & Usage
Current status & TODO
pgloader’s features
PGLoader will:
Load CSV data
Load pretend-to-be CSV data
Continue loading when confronted to errors
Apply user define transformation to data, on the fly
Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
8. Outline
Introduction
Architecture pgloader, the what?
Configuration examples & Usage
Current status & TODO
pgloader’s features
PGLoader will:
Load CSV data
Load pretend-to-be CSV data
Continue loading when confronted to errors
Apply user define transformation to data, on the fly
Optionaly have all your cores participate into processing
Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
9. Outline
Introduction
Main components
Architecture
Parallel Organisation
Configuration examples & Usage
Current status & TODO
Configuration
We first parse the configuration, with templating system
Example
[simple]
use_template = simple_tmpl
table = simple
filename = simple/simple.data
columns = a:1, b:3, c:2
Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
10. Outline
Introduction
Main components
Architecture
Parallel Organisation
Configuration examples & Usage
Current status & TODO
Loading: file reading
PGLoader supports many input formats, even if they all look like
CSV, the rough time is parsing data:
Read files one line at a time
Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
11. Outline
Introduction
Main components
Architecture
Parallel Organisation
Configuration examples & Usage
Current status & TODO
Loading: file reading
PGLoader supports many input formats, even if they all look like
CSV, the rough time is parsing data:
Read files one line at a time
Parse physical lines into logical lines
Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
12. Outline
Introduction
Main components
Architecture
Parallel Organisation
Configuration examples & Usage
Current status & TODO
Loading: file reading
PGLoader supports many input formats, even if they all look like
CSV, the rough time is parsing data:
Read files one line at a time
Parse physical lines into logical lines
Supports several readers
csvreader
Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
13. Outline
Introduction
Main components
Architecture
Parallel Organisation
Configuration examples & Usage
Current status & TODO
Loading: file reading
PGLoader supports many input formats, even if they all look like
CSV, the rough time is parsing data:
Read files one line at a time
Parse physical lines into logical lines
Supports several readers
textreader
csvreader
Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
14. Outline
Introduction
Main components
Architecture
Parallel Organisation
Configuration examples & Usage
Current status & TODO
Loading: file reading
PGLoader supports many input formats, even if they all look like
CSV, the rough time is parsing data:
Read files one line at a time
Parse physical lines into logical lines
Supports several readers
textreader
csvreader
fixedreader
Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
15. Outline
Introduction
Main components
Architecture
Parallel Organisation
Configuration examples & Usage
Current status & TODO
Processing lines
Parsing data is the CPU intensive part of the job. You could even
have to guess where lines begin and end.
Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
16. Outline
Introduction
Main components
Architecture
Parallel Organisation
Configuration examples & Usage
Current status & TODO
Processing lines
Parsing data is the CPU intensive part of the job. You could even
have to guess where lines begin and end. Then you add:
columns restrictions
Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
17. Outline
Introduction
Main components
Architecture
Parallel Organisation
Configuration examples & Usage
Current status & TODO
Processing lines
Parsing data is the CPU intensive part of the job. You could even
have to guess where lines begin and end. Then you add:
columns restrictions
columns reordering
Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
18. Outline
Introduction
Main components
Architecture
Parallel Organisation
Configuration examples & Usage
Current status & TODO
Processing lines
Parsing data is the CPU intensive part of the job. You could even
have to guess where lines begin and end. Then you add:
columns restrictions
columns reordering
user defined columns (constants)
Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
19. Outline
Introduction
Main components
Architecture
Parallel Organisation
Configuration examples & Usage
Current status & TODO
Processing lines
Parsing data is the CPU intensive part of the job. You could even
have to guess where lines begin and end. Then you add:
columns restrictions
columns reordering
user defined columns (constants)
user defined reformating modules
Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
20. Outline
Introduction
Main components
Architecture
Parallel Organisation
Configuration examples & Usage
Current status & TODO
COPYing to PostgreSQL
This is how we do it:
python cStringIO buffers
Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
21. Outline
Introduction
Main components
Architecture
Parallel Organisation
Configuration examples & Usage
Current status & TODO
COPYing to PostgreSQL
This is how we do it:
python cStringIO buffers
configurable size (copy every)
Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
22. Outline
Introduction
Main components
Architecture
Parallel Organisation
Configuration examples & Usage
Current status & TODO
COPYing to PostgreSQL
This is how we do it:
python cStringIO buffers
configurable size (copy every)
using copy expert() when available (CVS)
Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
23. Outline
Introduction
Main components
Architecture
Parallel Organisation
Configuration examples & Usage
Current status & TODO
COPYing to PostgreSQL
This is how we do it:
python cStringIO buffers
configurable size (copy every)
using copy expert() when available (CVS)
dichotomic error search
Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
24. Outline
Introduction
Main components
Architecture
Parallel Organisation
Configuration examples & Usage
Current status & TODO
Handling of erroneous data input
PGLoader will continue processing your input when it contains
erroneous data.
reject data file
Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
25. Outline
Introduction
Main components
Architecture
Parallel Organisation
Configuration examples & Usage
Current status & TODO
Handling of erroneous data input
PGLoader will continue processing your input when it contains
erroneous data.
reject data file
reject log file, containing error messages
Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
26. Outline
Introduction
Main components
Architecture
Parallel Organisation
Configuration examples & Usage
Current status & TODO
Handling of erroneous data input
PGLoader will continue processing your input when it contains
erroneous data.
reject data file
reject log file, containing error messages
errors count in summary
Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
27. Outline
Introduction
Main components
Architecture
Parallel Organisation
Configuration examples & Usage
Current status & TODO
Error logging
PGLoader will continue processing your input when it contains
erroneous data, and will make it so that you know about the
failures.
log file
Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
28. Outline
Introduction
Main components
Architecture
Parallel Organisation
Configuration examples & Usage
Current status & TODO
Error logging
PGLoader will continue processing your input when it contains
erroneous data, and will make it so that you know about the
failures.
log file
console log level: client min messages
Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
29. Outline
Introduction
Main components
Architecture
Parallel Organisation
Configuration examples & Usage
Current status & TODO
Error logging
PGLoader will continue processing your input when it contains
erroneous data, and will make it so that you know about the
failures.
log file
console log level: client min messages
logfile log level: log min messages
Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
30. Outline
Introduction
Main components
Architecture
Parallel Organisation
Configuration examples & Usage
Current status & TODO
Why going parallel?
Loading is IO bound, not CPU bound, right?
Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
31. Outline
Introduction
Main components
Architecture
Parallel Organisation
Configuration examples & Usage
Current status & TODO
Why going parallel?
Loading is IO bound, not CPU bound, right?
for large disks array, not so much
Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
32. Outline
Introduction
Main components
Architecture
Parallel Organisation
Configuration examples & Usage
Current status & TODO
Why going parallel?
Loading is IO bound, not CPU bound, right?
for large disks array, not so much
with complex parsing, not so much
Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
33. Outline
Introduction
Main components
Architecture
Parallel Organisation
Configuration examples & Usage
Current status & TODO
Why going parallel?
Loading is IO bound, not CPU bound, right?
for large disks array, not so much
with complex parsing, not so much
with heavy user rewritting, not so much
Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
34. Outline
Introduction
Main components
Architecture
Parallel Organisation
Configuration examples & Usage
Current status & TODO
Ok... How?
mutli-threading is easy to start with in python
Example
class PGLoader(threading.Thread):
Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
35. Outline
Introduction
Main components
Architecture
Parallel Organisation
Configuration examples & Usage
Current status & TODO
Ok... How?
mutli-threading is easy to start with in python
then you add in dequeues and semaphores (critical sections)
and signals
Example
class PGLoader(threading.Thread):
Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
36. Outline
Introduction
Main components
Architecture
Parallel Organisation
Configuration examples & Usage
Current status & TODO
Ok... How?
mutli-threading is easy to start with in python
then you add in dequeues and semaphores (critical sections)
and signals
Giant Interpreter Lock
Example
class PGLoader(threading.Thread):
Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
37. Outline
Introduction
Main components
Architecture
Parallel Organisation
Configuration examples & Usage
Current status & TODO
Ok... How?
mutli-threading is easy to start with in python
then you add in dequeues and semaphores (critical sections)
and signals
Giant Interpreter Lock
fork() based reimplementation could be of interrest
Example
class PGLoader(threading.Thread):
Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
38. Outline
Introduction
Main components
Architecture
Parallel Organisation
Configuration examples & Usage
Current status & TODO
Parallelism choices
Has beed asked by some hackers, their use cases dictated two
different modes.
Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
39. Outline
Introduction
Main components
Architecture
Parallel Organisation
Configuration examples & Usage
Current status & TODO
Parallelism choices
Has beed asked by some hackers, their use cases dictated two
different modes.
The idea is to have a parallel pg restore testbed, interresting with
large input files (100GB to several TB). PGLoader’s can’t compete
to plain COPY, due to clientserver roundtrips compared to local file
reading, but with some more CPUs feeding the disk array, should
show up nice improvements.
Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
40. Outline
Introduction
Main components
Architecture
Parallel Organisation
Configuration examples & Usage
Current status & TODO
Parallelism choices
Has beed asked by some hackers, their use cases dictated two
different modes.
The idea is to have a parallel pg restore testbed, interresting with
large input files (100GB to several TB). PGLoader’s can’t compete
to plain COPY, due to clientserver roundtrips compared to local file
reading, but with some more CPUs feeding the disk array, should
show up nice improvements.
Testing and feeback more than welcome!
Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
41. Outline
Introduction
Main components
Architecture
Parallel Organisation
Configuration examples & Usage
Current status & TODO
Round robin reader
Parsing is all done by a single thread for all the content.
Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
42. Outline
Introduction
Main components
Architecture
Parallel Organisation
Configuration examples & Usage
Current status & TODO
Round robin reader
Parsing is all done by a single thread for all the content.
N readers are started and get each a queue where to fill this round
data, and issue COPY while main reader continue parsing.
Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
43. Outline
Introduction
Main components
Architecture
Parallel Organisation
Configuration examples & Usage
Current status & TODO
Round robin reader
Parsing is all done by a single thread for all the content.
N readers are started and get each a queue where to fill this round
data, and issue COPY while main reader continue parsing.
Example
[rrr]
section_threads = 3
split_file_reading = False
Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
44. Outline
Introduction
Main components
Architecture
Parallel Organisation
Configuration examples & Usage
Current status & TODO
Split file reader
The file is split into N blocks and there’s as much pgloader doing
the same job in parallel as there are blocks.
Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
45. Outline
Introduction
Main components
Architecture
Parallel Organisation
Configuration examples & Usage
Current status & TODO
Split file reader
The file is split into N blocks and there’s as much pgloader doing
the same job in parallel as there are blocks.
Example
[rrr]
section_threads = 3
split_file_reading = True
Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
46. Outline
Introduction
Architecture
Configuration examples & Usage
Current status & TODO
Examples
PGLoader distribution comes with diverse examples, don’t forget
to see about them.
Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
47. Outline
Introduction
Architecture
Configuration examples & Usage
Current status & TODO
simple
That simple:
Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
48. Outline
Introduction
Architecture
Configuration examples & Usage
Current status & TODO
simple
That simple:
Example
[simple]
table = simple
filename = simple/simple.data
format = text
datestyle = dmy
field_sep = |
trailing_sep = True
columns = *
Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
49. Outline
Introduction
Architecture
Configuration examples & Usage
Current status & TODO
User defined columns
Constant columns added at parsing time.
Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
50. Outline
Introduction
Architecture
Configuration examples & Usage
Current status & TODO
User defined columns
Constant columns added at parsing time.
Use case: adding an origin server id field depending on the file to
get loaded, for data aggregation.
Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
51. Outline
Introduction
Architecture
Configuration examples & Usage
Current status & TODO
User defined columns
Constant columns added at parsing time.
Use case: adding an origin server id field depending on the file to
get loaded, for data aggregation.
Example
[server_A]
file = imports/A.csv
columns = b:2, d:1, x:3, y:4
udc_c = A
copy_columns = b, c, d
Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
52. Outline
Introduction
Architecture
Configuration examples & Usage
Current status & TODO
User defined Reformating modules
The basic idea is to avoid any pre-processing done with another
tool (sed, awk, you name it).
Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
53. Outline
Introduction
Architecture
Configuration examples & Usage
Current status & TODO
User defined Reformating modules
The basic idea is to avoid any pre-processing done with another
tool (sed, awk, you name it).
file has ’12131415’
Example
[fixed]
table = fixed
format = fixed
filename = fixed/fixed.data
columns = *
fixed_specs = a:0:10, b:10:8, c:18:8, d:26:17
reformat = c:pgtime:time
Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
54. Outline
Introduction
Architecture
Configuration examples & Usage
Current status & TODO
User defined Reformating modules
The basic idea is to avoid any pre-processing done with another
tool (sed, awk, you name it).
file has ’12131415’ we want ’12:13:14.15’
Example
def time(reject, input):
quot;quot;quot; Reformat str as a PostgreSQL time quot;quot;quot;
if len(input) != 8:
reject.log(mesg, input)
hour = input[0:2]
...
return ’%s:%s:%s.%s’ % (hour, min, secs, cents)
Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
55. Outline
Introduction
Architecture
Configuration examples & Usage
Current status & TODO
The fine manual says it all
At http://pgloader.projects.postgresql.org/ or man
pgloader
Example
> pgloader --help
> pgloader --version
> pgloader -DTsc pgloader.conf
Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
56. Outline
Introduction
Architecture
Configuration examples & Usage
Current status & TODO
TODO
http://pgloader.projects.postgresql.org/dev/TODO.html
Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
57. Outline
Introduction
Architecture
Configuration examples & Usage
Current status & TODO
TODO
http://pgloader.projects.postgresql.org/dev/TODO.html
Constraint Exclusion support
Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
58. Outline
Introduction
Architecture
Configuration examples & Usage
Current status & TODO
TODO
http://pgloader.projects.postgresql.org/dev/TODO.html
Constraint Exclusion support
Reject Behaviour
Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
59. Outline
Introduction
Architecture
Configuration examples & Usage
Current status & TODO
TODO
http://pgloader.projects.postgresql.org/dev/TODO.html
Constraint Exclusion support
Reject Behaviour
XML support with user defined XSLT StyleSheet
Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
60. Outline
Introduction
Architecture
Configuration examples & Usage
Current status & TODO
TODO
http://pgloader.projects.postgresql.org/dev/TODO.html
Constraint Exclusion support
Reject Behaviour
XML support with user defined XSLT StyleSheet
Facilities
Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
61. Outline
Introduction
Architecture
Configuration examples & Usage
Current status & TODO
TODO
http://pgloader.projects.postgresql.org/dev/TODO.html
Constraint Exclusion support
Reject Behaviour
XML support with user defined XSLT StyleSheet
Facilities
Don’t be shy and just ask for new features!
Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
62. Outline
Introduction
Architecture
Configuration examples & Usage
Current status & TODO
Resources and Users
pgfoundry, 1 developper, some users, no mailing list yet (no one
asking for one), some mails sometime, seldom bug reports (fixed)
Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
63. Outline
Introduction
Architecture
Configuration examples & Usage
Current status & TODO
Resources and Users
pgfoundry, 1 developper, some users, no mailing list yet (no one
asking for one), some mails sometime, seldom bug reports (fixed)
Support ongoing at #postgresql and #postgresqlfr
Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
64. Outline
Introduction
Architecture
Configuration examples & Usage
Current status & TODO
Resources and Users
pgfoundry, 1 developper, some users, no mailing list yet (no one
asking for one), some mails sometime, seldom bug reports (fixed)
Support ongoing at #postgresql and #postgresqlfr
packages for debian, FreeBSD, OpenBSD, CentOS, RHEL and
Fedora.
Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL