AsyncIO can speed up web scraping by allowing asynchronous requests and parsing to occur concurrently without blocking. It uses coroutines and an event loop to schedule tasks. For scraping, URLs can be fetched asynchronously using aiohttp. Results are gathered after tasks complete without waiting sequentially. Performance can be monitored using tools like aiomonitor which provide a task process list and console. MongoDB can be used to save crawling results with the batch ID and track success/error counts.
7. Sync Async
Task are executed one by one
One process runs after the
previous process is complete
Task start and continue
running while the execution
moves on to a new task
Async tasks do not block the
operation
Usually run in background
9. Parallelism Concurrency
Making progress in parallel
Executing more than one task
at the same time
More than one task at the
same time
Sharing time slices from the
same core
15. AsyncIO
Coroutine: Like Generator, but coroutine
consume data. Coroutine can pause and
resume. It’s a way of pausing a function and
returning a series of values periodically
Future / Task : A Future is an object that is
supposed to have a result in the future. A Task
is a subclass of Future that wraps a coroutine.
When the coroutine finishes, the result of the
Task is realized. Subclass of Future that wraps
coroutine.
Event Loop: Central Executor in AsyncIO. An
23. Coroutine
async def fetch(url):
async with aiohttp.ClientSession() as session:
async with session.get(url) as response:
html = await response.read()
print("{}, length: {}".format(url, len(html)))
return {
'resp': response,
'html': html,
'url': url,
}
24. ensure_future
Schedule the execution of a coroutine object:
wrap it in a future. Return a Task object.
futures = [asyncio.ensure_future(fetch(url)) for url in urls]
25. Results
results = []
for future in futures:
try:
resp = await future
results.append(resp)
except Exception as error:
print(url, error)
continue
29. aiomonitor
if __name__ == '__main__':
loop = asyncio.get_event_loop()
with aiomonitor.start_monitor(loop=loop):
loop.run_until_complete(main(loop))
aiomonitor is Python 3.5+ module that adds monitor and cli capabilities
for asyncio application. Idea and code borrowed from curio project.
Task monitor that runs concurrently to the asyncio loop (or fast drop in
replacement uvloop) in a separate thread as result monitor will work
even if event loop is blocked for some reason.
https://github.com/aio-libs/aiomonitor
35. Flow
Using MongoDB to save crawling results
Insert to batch collection and get batch ID
Start tasks
List all pending tasks
If there is any error, save to error collection
Iterate finished task to get the result and save
to news collection
Update batch with the number of success and
error results
36. List Pending Tasks
task_ids = {}
for task in sorted(asyncio.Task.all_tasks(loop=loop), key=id):
t = str(task)
taskid = str(id(task))
task_ids[taskid] = task._state
Save Results
tasks = asyncio.Task.all_tasks(loop=loop)
for task in tasks:
try:
if 'result' in dir(task):
task_result = {
'id': id(task),
'batch_id': batch_id,
'before': task_ids[str(id(task))],
'after': task._state,
'url': task.result()['url']
}
db.news.insert(task_result)
except:
pass