O slideshow foi denunciado.
Seu SlideShare está sendo baixado. ×

IEEE ICDM 2018 Tutorial on Blockchain Data Analytics

Próximos SlideShares
Blockchain and bitcoin
Blockchain and bitcoin
Carregando em…3

Confira estes a seguir

1 de 120 Anúncio

Mais Conteúdo rRelacionado

Semelhante a IEEE ICDM 2018 Tutorial on Blockchain Data Analytics (20)

Mais recentes (20)


IEEE ICDM 2018 Tutorial on Blockchain Data Analytics

  1. 1. Blockchain Data Analytics Tutorial Cüneyt Gürcan Akçora, Yulia R. Gel, Murat Kantarcioglu Joint work with N. C. Abay, Y. Chen, M. Dixon, A. K. Dey, U. Islambekov, Y. Li, E. Smirnova, B. Thuraisingham Depts. of Computer Science and Math Sciences University of Texas at Dallas BlockchainTutorial.Github.ioIEEE ICDM 2018 Blockchain Day, Singapore
  2. 2. 2 Outline • A brief history of Blockchain • Building blocks of Blockchain • Blockchain data models and structures • TXO and account based blockchains • Privacy and security in blockchains • Financial analytics on blockchains
  3. 3. 1- Blockchain Data Analytics - Core Blockchain How Blockchains appeared? How do they work? What are the design considerations? What is the data stored on a blockchain?
  4. 4. 4 Core Blockchain 10/31/2008: Satoshi Nakamoto posts the Bitcoin white paper to a forum. 1/3/2009: The first data block in the Bitcoin. Coin Timeline* Bitcoin: A peer to peer Electronic Cash System * By JEFF DESJARDINS. Image retrieved from VisualCapitalist.com and updated. Smart contracts, lightning networks, added privacy
  5. 5. 5 Blockchain Network Every node runs the same software to verify data blocks. Each node is connected to a few other nodes only. New nodes appear and existing ones disappear all the time. There is no trusted node. Every node has the full copy of the data. Goal: Having a single truth about data, that can be verified by everyone.
  6. 6. 6 Bitcoin: A financial application of Blockchain Blockchain: a distributed ledger (i.e., “a book laying or remaining regularly in one place”). Block Blockchain: a chain of data blocks False data Which peer should the node a believe about block 4? 1 2 3 4 Bitcoin: chain data contains financial transactions. 1 2 3 4 True data a
  7. 7. 7 Bitcoin 2 bitcoins 1 bitcoins 2 bitcoins From: Cuneyt To: Joe (1BTC), Tim (2BTC). Use the 3 bitcoins I received in Block 1 transaction 3. Signed: Cuneyt From: Jim To: Chris (2BTC). Use the 2 bitcoins I received in Block 2 transaction 1. Signed: Jim 1MB block size = ~ 2K transactions Two inherent problems: • Authenticity (You really have the funds) • Double spending (You are not using the same funds twice) • Authenticity is solved with encrypted signatures, and showing the proof of funds. • Confirmation of payments requires more effort: the double spending problem. From: Cuneyt To: Joe (1BTC), Tim (2BTC). Use the 3 bitcoins I received in Block 1 transaction 3. Signed: Cuneyt From: Jim To: Chris (2BTC). Use the 2 bitcoins I received in Block 2 transaction 1. Signed: Jim
  8. 8. 8 Core Blockchain 5 5 • If everyone can create blocks, the blockchain may never stabilize. Fork 1 Fork 2 4321 From: Jim To: Chris (2BTC). Use the 2 bitcoins I received in Block 2 transaction 1. Signed: Jim 5 5 From: Jim To: John (2BTC). Use the 2 bitcoins I received in Block 2 transaction 1. Signed: Jim Jim is malicious: He is trying to use the same coins in two payments. Jim is hoping that Chris and John will not notice the other payment. 1- If fork 1 becomes the canonical fork, John will be defrauded. 2- If fork 2 becomes the canonical fork. Chris will be defrauded.
  9. 9. 9 Core Blockchain • We cannot have a stable chain if we cannot be certain about blocks. There cannot be multiple long forks with alternative truths. • Solution: Make block creation difficult. Allow the network sometime between blocks, so that the current state will be learned by all (or most) nodes. • How can we stop people from creating blocks? Ask a cryptographic puzzle! 65 65 Fork 1 Fork 2 4321
  10. 10. 10 Core Blockchain From: Cuneyt To: Alice Date: 1/1/2027 …..mail content…. This mail has 35 words Proof-of-Work was first used in email spam detection If the proof of work is not attached, it is spam! Else count the words If the word count in proof of work is wrong Discard the email, spam! Else email might be spam, run spam detector. Proof of work: In this simple example, it is counting words. This algorithm is used by the email service provider
  11. 11. 11 Core Blockchain Proof-of-Work: Spending time and effort to create (mine) a block. The idea is to slow down attackers. Bitcoin uses a hash puzzle for Proof of work. Hash(University) = 7FDD903AF601C14E71D4938B2F7AB58A78C03C36D43485BB1937826B90DEFDD0 Hash(Univarsity) = 7E984B4F8807A0092C65AE3D897DD186943D95435C0A56F8350A0C7F82ACEF03 Proof of work: Find a hash value that satisfies a given difficulty.
  12. 12. 12 Core Blockchain Miner From: Jim To: Chris (2BTC). Use the 2 bitcoins I received in Block 2 transaction 1. Signed: Jim A node chooses to be a miner
  13. 13. Mining • Mining is the process of gathering transactions that are in the system waiting, creating a block out of them and advertising it to the other nodes in the system. • Creating a block is the computational review process performed on transactions. • Each block is limited to 1MB. Can hold ~2K transactions only. A block can have 1 transaction only (as in many earlier blocks). Do you see a possible problem here? • Everyone can create transactions, but only miners can create blocks. (Nuclear scientist were caught running mining software in supercomputers).
  14. 14. 14 Mining – Creating the block • Several issues must be addressed in mining: • Nothing is physical, the coins you spend may be fake (Verify source). • Even when the coins are not fake, you may have already spent it (Verify history). • Is the sender the real owner of these coins, is the receiver address correct? (Verify users). • Is the output, inputs amounts correct? (Verify the amount). Miner checks and verifies all these. There are many nodes, but few miners on most blockchains.
  15. 15. 15 Core Blockchain For(nonce = 1 to infinity) blockHash = Hash( [hashOfBlock + hashOfPrevBlock + …]+ nonce) If(blockHash satisfies difficulty) block mined successfully! ) = hashOfBlockHash of block content ( • The miner increases the nonce until a useful blockHash is found. • If such a nonce does not exist, the miner can start over by re- arranging the blocks.
  16. 16. 16 Core Blockchain 1 2 3 4 65 • Once the block is mined, the miner broadcasts it to all its peers. The block propagates in the network. • Mining a block does not guarantee that the block will be included in the blockchain. • Other miners need to build their blocks on top of the block. • Colluding miners can ignore a mined block. Furthermore, they can cooperate and build blocks on each other’s blocks only.
  17. 17. 17 Core Blockchain 1 2 3 4 65 Miner 1 Miner 2 Miner 3 Miner 4 arrives at 𝑡1 to create a block, finds 3 competing last blocks. 𝑡1 Depending on which block to build on, Miner 4 has to exclude transactions that have already been mined. 𝑡0 ?
  18. 18. 18 Core Blockchain 1 2 3 4 65 Miner 1 Miner 2 Miner 3 Let’s suppose that Miner 4 chooses to build on the block of Miner 1. 𝑡1 𝑡2 Miner 4 • Miner 5 arrives at 𝑡2 and sees 3 forks – The logical choice is to build on the longest fork of Miner 1 and 4*. • Miner 5 may still choose to build on other forks – may be a costly mistake. 𝑡0 *both length and difficulty are considered.
  19. 19. Proof of Work: An example • How difficult is proof of work? Consider “Hello world!”+nonce • If the difficulty is three zeros (000….), we try 4251 nonce values • Hash("Hello, world!0”) => 1312af178c253f84028d480a6adc1e25e81caa44c749ec81976192e2ec934c64 • Hash("Hello, world!4250”) => 0000c3af42fc31103f1fdc0151fa747ff87349a4714df7cc52ea464e12dcd4e9 Bitcoin uses an adaptive difficulty that changes with how much computing power exists in the mining business.
  20. 20. 20 Core Blockchain – adjusting difficulty Hash of Bitcoin Block #547873 (October 2018) [20 zeros] 0000000000000000000064eb6ef4f94808938de0889695dd7bb8dca70b334cb2 Hash of Bitcoin Block #3 (January 2009) [8 zeros] 00000000b3322c8c3ef7d2cf6da009a776e6a99ee65ec5a32f3f345712238473 Hash of Bitcoin Block #350000 (March 2015) [17 zeros] 0000000000000000053cf64f0400bb38e0c4b3872c38795ddde27acb40a112bb • The desired rate is one block every 10 minutes. This is periodically checked every 2016 blocks (2 weeks). • If 2016 blocks took less than two weeks, the difficulty is increased.
  21. 21. 1 10 100 1000 10000 100000 1000000 10000000 100000000 1E+09 1E+10 1E+11 1E+12 1E+13 1/27/2009 1/27/2012 1/27/2015 1/27/2018 Difficulty Time Proof of Work: Bitcoin difficulty in time Decreases are possible With max possible difficulty we will need to try > 1077 nonce values. Bitcoin: more than 1021 tries to find a valid nonce! Data from BTC.com
  22. 22. 22 • Block reward halves every 4 years. Starting with 50 bitcoins per block, this will create 21M bitcoins in total. • Transaction fee is the amount unspent from inputs to outputs. • The fee may also be zero – but why would anyone mine your transaction? From: Cuneyt To: Joe (0.8 BTC), Tim (2 BTC). Use the 3 bitcoins I received in Block 1 transaction 3. Signed: Cuneyt 0.8 bitcoin 2 bitcoins transaction fee = 0.2 bitcoins Block reward Sum of all transaction fees Incentives for mining
  23. 23. 23 • Around May 2020 the block reward will halve to 6.25 bitcoins. • 2140 is the year when the reward will be practically zero. • Transactions fees will carry the system after block rewards become trivial. • November 2018: block reward is 12.5B, transaction fees are ~0.05B. • Fees are trivial if the market volume is low. Users leave aside lower fees. • In December 2017, fees were around 5-7B.
  24. 24. 24 Bitcoin mining • One winning miner every 10 mins. Many others lose and waste electricity. • Eric Jennings: “The cost for having no central authority is the cost of that energy”. • Tim Swanson: “Bitcoin is a peer-to-peer heat engine”. • Narayanan: “Bitcoin mining has been an expensive way to bet that the price of Bitcoin would rise”.
  25. 25. 25 Proof-of-X Proof-of-X is an umbrella term that covers Proof-of-Work alternatives in block mining. Each alternative scheme expects miners to show a proof that they have done enough work or spent enough wealth before creating the block. • Proof-of-Stake: Stake = Coin×Age. The miner with the highest stake becomes the next miner in the chain. Once coins are used, their age becomes zero. Rich gets richer! • Proof-of-Burn: The miner sacrifices wealth: creates a transaction and sends some coins to a “verifiably unspendable” address. Reduces total supply! • Proof-of-Ownership, Proof-of-Publication, and others…
  26. 26. 26 Blockchains – why stop at cryptocurrencies? Every node runs the same software to verify data blocks. Each node is connected to a few other nodes only. New nodes appear and existing ones disappear all the time. There is no trusted node. Every node has the full copy of the data.
  27. 27. 27 0.8 bitcoin 2 bitcoins From: Cuneyt To: Joe (0.8 BTC), Tim (2 BTC). Use the 3 bitcoins I received in Block 1 transaction 3. Signed: Cuneyt data Bitcoin: data are financial transactions. Tschorsch, Florian, and Björn Scheuermann. Bitcoin and beyond: A technical survey on decentralized digital currencies. IEEE Communications Surveys & Tutorials 18, no. 3 (2016): 2084-2123.
  28. 28. 28 - Notary Documents - Pictures - Identity Documents - Shipping logs - Manufacturing logs - IOT data Data can be more: 1- On-chain storage 2- Off-chain storage:  Store hashes of data (as proof)  Store the address of data (Our data resides as IP:
  29. 29. 30 Blockchain Network – Beyond Cryptocurrencies • Ethereum has been created to store data and software code on a blockchain. • Similar to Bitcoin, Ethereum has a currency: Ether. • The code (a smart contract) is written in the proprietary coding language Solidity, which is compiled to bytecode and executed on the Ethereum Virtual Machine. • An analogy is the MYSQL snippets stored on a database. Solidity
  30. 30. 31 Blockchain Network – Smart Contract • User creates a transaction to upload the Smart Contract code to an address. • The code at the address is replicated in all blockchain nodes. • In other words, you force other users to store your code. • The code is executed by passing transaction messages to its functions. Execution occurs at all nodes – hence the World Computer! • Contract creation is expensive. • All subsequent calls to the contract code are billed in terms of what operations they require.
  31. 31. 32 Blockchain Network – Contracts • Each operation has a gas price for executing it. • For example, using the ‘addition’ operation costs you 3 gas. Image: https://hackernoon.com/
  32. 32. 33 Ethereum – the World Computer Benefits of having code on a blockchain Public code Code can be analyzed by everyone. Unmodifiable code Code cannot be modified without leaving a trace. Unstoppable execution Code will run to completion. Verifiable results Results can be verified by all parties. It is easy to see why platform creators called the code Smart Contract!
  33. 33. 34 Ethereum – the World Computer • Contracts gave rise to Smart Contract based tokens: exchanged data units that are used to buy/sell services in the real world. • For example, Storj token stores files on your hard disk, and pays you a fee through Ethereum. • Tokens can be bought or sold; they act as value stores. Token prices are arbitrated in the real world. • Companies create tokens, and sell them in Initial Coin Offerings to raise capital.
  34. 34. 35 Blockchain tokens New Ethereum token contracts in time (>5K transactions in early 2018)
  35. 35. 36 Blockchain tokens Ethereum token transactions in time
  36. 36. 37 Platforms– Standardization Continues • Initially, tokens could implement a vital function (e.g., transfer) with any name (e.g., sell, transferTo, sendTo). • ERC20 standard enforces a list of functions that must be implemented by a token: 2018 May. Data from our Chartalist project
  37. 37. 38 Blockchain tokens and platforms Left Ethereum
  38. 38. 2- Blockchain Graph Analytics Are transactions the same on all blockchains? How can we model Blockchain data?
  39. 39. 40 Blockchain Graph Analytics • For data modelling, blockchains can be divided into two major categories: Account based blockchains (e.g., Ethereum) Transaction output (TXO) based blockchains (e.g., Bitcoin, Litecoin)
  40. 40. 41 2a - Transaction output based blockchains
  41. 41. 42 Transaction output (TXO) based blockchains 0.8 bitcoin 2 bitcoins 3B 0.8B 2B Transaction 1 Address 0.2B tx fee Next, if address b wants to spend its received 2B, it needs to show proof of funds: “Use the 2B I received from Block 1, transaction 1 and to pay 1.5B to c and 0.3B to d”. a b 3B 0.8B 2B 2B 1.5B 0.3B c d b a b
  42. 42. 43 Transaction output (TXO) based blockchains • Genesis block 0: The first block, created by Nakamoto. • Every block has one coinbase transaction that creates bitcoins (sum of block reward + transaction fees). • All other payments must show proof of funds (previous outputs). Coinbase transaction Block n Block n+1 Time
  43. 43. 44 Three Graph Rules for TXO 1- Source Rule: Coins can be gained from multiple transactions. These can be spent at once or separately (dashed edges connect to unspecified addresses). b Address b can spend bitcoins at 𝑡𝑥1(once), or at 𝑡𝑥1 and 𝑡𝑥2. 𝑡𝑥1 𝑡𝑥2
  44. 44. 45 Three Graph Rules for TXO 2- Balance Rule: All coins gained from a transaction must be spent in a single transaction. Addresses cannot keep change, must forward it. Same user? Address reuse is rare c d e i - c sold all its coins: c, d and e all belong to different people, or ii - c paid to d, and forwarded the change to its new address e. In many scenarios, we have to learn which addresses belong to the same entity. Two cases:
  45. 45. 46 Three Graph Rules for TXO 3 – Mapping Rule: Multiple inputs can be signed separately and merged, but the input-output address mappings are not recorded. A transaction can be considered a lake with incoming rivers, and outgoing emissaries. Coins mix in this lake. 1B 1B 1B 1B Heuristics are developed to link inputs to outputs – we will cover them in the privacy section.
  46. 46. 47 Existing Graph Approaches Transaction graph: Edges between transactions only. Transaction graph Cannot capture unspent coins. Cannot distinguish transactions with differing inputs/outputs. Blockchain graph Different inputs/outputs Dorit Ron and Adi Shamir. 2013. Quantitative analysis of the full bitcoin transaction graph. In International Conference on Financial Cryptography and Data Security. Springer, 6–24.
  47. 47. 48 Existing Graph Approaches 2- Address graph: Edges between addresses only. Address graphBlockchain graph  Edges are multiplied between inputs and outputs: creates 1 million edges for a 1000 input, 1000 output transaction.  Creates bias for average degree, even for median degree. Michele Spagnuolo, Federico Maggi, and Stefano Zanero. 2014. Bitiodine: Extracting intelligence from the bitcoin network. In International Conference on Financial Cryptography and Data Security. Springer, 457–4
  48. 48. 49 Existing Graph Approaches Graph Analysis with single node type: Not always useful for the forever forward branching tree of Bitcoin. 2- Address graph: is it worth the trouble searching for graph motifs?  Addresses are not supposed to re-appear in future.  Closed triangles are very rare.  Output/input address sets do not have edges to each other – our tools do not consider this, and search for edges in vain (linked transactions within a block are possible but rare).
  49. 49. 50 Blockchain Graph – Substructure mining Definition [K-Chainlets]: Let k-chainlet Gk = (Vk, Ek, B) be a subgraph of G with k nodes of type {Transaction}. If there exists an isomorphism between Gk and G’, G’ ∈ G, we say that there exists an occurrence, or embedding of Gk in G. If a Gk occurs more/less frequently than expected by chance, it is called a Blockchain k-chainlet. A k-chainlet signature fG(Gk) is the number of occurrences of Gk in G. • Rather than individual edges or nodes, we use a subgraph as the building block in our Bitcoin analysis. • We use the term chainlet to refer to such subgraphs. Cuneyt G. Akcora, Asim Kumer Dey, Yulia R. Gel, and Murat Kantarcioglu. Forecasting Bitcoin Price with Graph Chainlets. In Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 765- 776. Springer, Cham, 2018.
  50. 50. 51 Blockchain Chainlets • Chainlets have distinct shapes that reflect their role in the network. • We aggregate these roles to analyze network dynamics. Tx 1 Tx 1 Tx 2 Tx 2 Tx 3 Tx 3 Tx 4 Tx 4 Three distinct types of 1-chainlets!
  51. 51. 52 Aggregate Chainlets Transition. Ex: Chainlet C3→3 Cx→y : chainlet with x inputs and y outputs. • Transition Chainlets imply coins changing address: x = y. Split. Ex: Chainlet C1→2 • Split Chainlets may imply spending behavior: y > x. But, community practice against address reuse can also create split chainlets. Merge. Ex: Chainlet C3→1 • Merge Chainlets imply gathering of funds: x > y.
  52. 52. 53 Aggregate Chainlets Percentage of aggregate chainlets in the Bitcoin Graph (weekly snapshots) Around here 2 pizzas are worth 10 thousand bitcoins. Non è il bel paese!
  53. 53. Outputs 1 2 3 inputs 321 54 Chainlet Matrix • For a given time granularity, such as one day, we take snapshots of the Bitcoin graph. • Chainlet counts obtained from the graph are stored in an N×N matrix. Representing the network in time 2 1 1 0 0 0 0 0 0 N: How big should the matrix be?
  54. 54. 55 Extreme Chainlets • N can reach thousands, the matrix can be 1000 × 1000. • On Bitcoin, % 90.50 of the chainlets have N of 5 (x < 5 and y < 5), and % 97.57 for N of 20. Outputs 1 2 3 inputs 321 2 1 1 0 0 0 0 0 0 4 Extreme chainlets are the last column/row of the chainlet matrix. They imply big coin movements in the graph! Occurrence matrix 𝑂[𝑖, 𝑗] = #𝐶𝑖→𝑗 if 𝑖 < 𝑁 𝑎𝑛𝑑 𝑗 < 𝑁 𝑧=𝑁 ∞ #𝐶𝑖→𝑧 if 𝑖 < 𝑁 𝑎𝑛𝑑 𝑗 = 𝑁 𝑦=𝑁 ∞ #𝐶 𝑦→𝑗 if 𝑖 = 𝑁 𝑎𝑛𝑑 𝑗 < 𝑁 𝑦=𝑁 ∞ 𝑧=𝑁 ∞ #𝐶 𝑦→𝑧 if 𝑖 = 𝑁 𝑎𝑛𝑑 𝑗 = 𝑁
  55. 55. 56 Extreme Chainlets Bitcoin companies stopped all business in New York State because of new regulations. The New York Business Journal called this the "Great Bitcoin Exodus". Percentage of extreme chainlets in the Bitcoin Graph (N = 20, daily snapshots)
  56. 56. 57 Clustering the Chainlets • A hierarchical clustering of chainlets by using Cosine Similarity over chainlet signatures in time. • We used a similarity cut threshold of 0.7 to create clusters from the hierarchical dendrogram. Chainlet clusters for daily snapshots Chainlet clusters for weekly snapshots Most common chainlets Extreme and correlated chainlets
  57. 57. 58 2b - Account based blockchains
  58. 58. 59 Account based blockchains • On account based blockchains, transactions involve one input address and one output address. • An address spends coins from a balance, keeps the change. • Each transaction of an address has an order (called nonce). The nonce is the number of transactions sent to the network by the address. • A later transaction needs to wait for earlier transactions to be mined. 4E
  59. 59. 60 Internal transactions • Account based blockchains have two types of “transactions”. • The first transaction type involves a transfer of the used cryptocurrency, such as Ether on Ethereum. • The second type are internal transactions, which involve a transfer of smart contract based tokens. • Internal transactions are created when smart contracts change states of addresses. • Internal transactions can be discovered in two ways: by parsing ordinary transactions’ messages, or by running the transaction message through the smart contract code. • The parsing method cannot discover failed transactions.
  60. 60. 61 Internal transactions • A transaction can transfer both currencies and tokens. 4E Ordinary address Contract addressA transaction message on Ethereum 4E 4E 2 Contracts can start events as well, these are explicitly recorded. An internal transaction can create multiple edges, although this is rare on Ethereum.
  61. 61. 62 Trading tokens – a timeline 0Ether Send my 2 tokens to address a b 0.2Ether a b 0.3Ether I want to buy 2 Storj tokens b Storj token Balances: b: 2 Storj Balances: b: 0 Storj a: 2 Storj a pays 0.2E to b to buy its tokens. From 𝑡1 to 𝑡2, Storj price decreased in the market from 0.15E to 0.1E 𝑡1 𝑡2 𝑡0 All edges on the Ethereum graph. a b 2 0.3E 0.2E
  62. 62. 63 Account based blockchains The largest connected component in Storj network on 13-1-2018. • We model account based blockchains as directed, weighted, multi- graphs. • The network of a single token is usually sparse, and devoid of community structure. • Daily networks may contain many disconnected components.
  63. 63. 64 Inter-token networks • Account based blockchains are global market places where goods are exchanged in terms of tokens. Ethereum is a successful example. • Blockchain platforms will allow us to view global market activity in real time.
  64. 64. 65 Research questions • Account based blockchains lend themselves to traditional network analysis tools and algorithms. • Motif analysis, core decomposition, centrality and clustering algorithms can easily be adapted to work on account based blockchains. • High granularity temporal data allows time series analyses. • The rich variety of cryptotokens being traded on the network brings many interesting research problems: Token price prediction, price manipulation detection, token network health and robustness analysis, inter-token impact analysis, investor behavior analysis.
  65. 65. 66 3- Blockchain Privacy and Security Permissionless (public) Blockchains Permissioned (private) Blockchains Bitcoin, Litecoin, Ethereum Hyperledger, R3 • By definition any user can join a public blockchain (e.g., Bitcoin). • For corporate settings, the transparency means that rivals can learn company finances and buy/sale relationships. • The permissioned blockchains were created for industrial settings. • Permissioned: Less power consumption, more secure, privacy aware, but for all purposes a gated community.
  66. 66. 3- Blockchain Privacy and Security • In public blockchains, data is considered public. • Tapscott: There are no honeypots of personal data on the blockchain. • Public blockchains are pseudo-anonymous: There is no registration to join the network, but all your transactions are public. • For security, TOR can be used to send transactions to the P2P network. • As a threat, most online exchanges are governed by know-your- customer rules that require customer registration.
  67. 67. Blockchain communication graph • At its core, Bitcoin maintains a peer-to-peer architecture. Bitcoin peers create persistent TCP channels with each other and relay transactions. • Each peer seeks a minimum of 8, a maximum of 125 peers. • Each node forwards transactions arriving from a neighbor to other neighbors. • Transactions that await mining in the P2P network are contained in the mempool. • The first sender of a transaction is most likely to be the transaction owner.
  68. 68. Blockchain communication graph • Nodes forward incoming transactions selectively to hinder time based address inference. This is called trickling. • In this network, b is connected to all neighbors of a – by observing relayed transactions, b can deduce that transaction t3 originated from a. Andrew Miller, James Litton, Andrew Pachulski, Neal Gupta, Dave Levin, Neil Spring, and Bobby Bhattacharjee. 2015. Discovering bitcoin’s public topology and influential nodes. (2015). t2, t5, t4 t1, t2, t5 t1, t6, t5 t1, t3, t5 t1, t6, t4 t6, t7, t8 t1, t6, t4 a b
  69. 69. Blockchain TXO content graph • Can we tell which addresses are controlled by the same user/entity/organization? • In order to answer this question, we first need to map inputs to outputs. Where do the bitcoins at address a come from? a From nine addresses! Fungibility: Is a specific bitcoin worth a bitcoin everywhere? Taint analysis studies a bitcoin’s history.
  70. 70. Blockchain TXO content graph Heuristics are used to detect which input and output addresses are controlled by the same user. Meiklejohn, Sarah, Marjori Pomarole, Grant Jordan, Kirill Levchenko, Damon McCoy, Geoffrey M. Voelker, and Stefan Savage. A fistful of bitcoins: characterizing payments among men with no names. In Proceedings of the 2013 conference on Internet measurement conference, pp. 127-140. ACM, 2013. 1B 4B 3B 2B 1B 1B 1B 1B Considering amounts may help in basic cases. Schemes exist to use multiple rounds of flows with equal amounts to hide tracks.
  71. 71. Heuristics to link addresses Addresses a, b and c belong to the same user. 1- Idioms of Use: posits that all input addresses in a transaction should belong to the same entity because only the owner could have signed the inputs with the associated private keys. a b c
  72. 72. Heuristics to link addresses Addresses a, b, c, d and e belong to the same user. 2- Transitive Closure: extends Idioms of Use: if a transaction has inputs from a and b, whereas another transaction has from a and c, b and c belong to the same user. a b c d e
  73. 73. Heuristics to link addresses The heuristic then posits that the one-time change (output) address— if one exists— is controlled by the same user as the input addresses. 3- Change address: the following four conditions must be met: (1) the output address has not appeared in any previous transaction; (2) the transaction is not a coin generation; (3) there is no self-change address in the outputs (4) all the other output addresses in the transaction have appeared in previous transactions.
  74. 74. Obfuscation efforts • A measure to prevent matching addresses to users is known as Coin Mixing, or its improved version, CoinJoin. • The initial idea in mixing was to use a central server to mix inputs from multiple users. 2B 2B 2B 2B 5B 2B 3B 4B 2B 1B 1B 2B 3B 2B 2B 3B Ruffing, Tim, Pedro Moreno-Sanchez, and Aniket Kate. CoinShuffle: Practical decentralized coin mixing for Bitcoin. In European Symposium on Research in Computer Security, pp. 345-364. Springer, Cham, 2014.
  75. 75. Obfuscation efforts – peeling chains • In a peeling chain, a single address begins with a relatively large amount of bitcoins. • A smaller amount is then “peeled” off this larger amount, creating a transaction in which a small amount is sent to one address and the remainder is sent to a one-time change address. • This process is repeated— potentially for hundreds or thousands of hops— until the larger amount is pared down Di Battista, Giuseppe, Valentino Di Donato, Maurizio Patrignani, Maurizio Pizzonia, Vincenzo Roselli, and Roberto Tamassia. Bitconeview: visualization of flows in the bitcoin transaction graph. In Visualization for Cyber Security (VizSec), 2015 IEEE Symposium on, pp. 1-8. IEEE, 2015. Narayanan, Arvind, and Malte Möser. Obfuscation in bitcoin: Techniques and politics. arXiv preprint arXiv:1706.05432 (2017).
  76. 76. Obfuscation efforts – peeling chains 25B 0.5B 0.5B 0.5B 0.5B … Repeated patterns are frequently found on the Bitcoin blockchain. Exit to fiat currency McGinn, Dan, David Birch, David Akroyd, Miguel Molina- Solana, Yike Guo, and William J. Knottenbelt. Visualizing dynamic bitcoin transaction patterns. Big data 4, no. 2 (2016): 109-119.
  77. 77. Network clustering of addresses • By nature all user clustering heuristics are error prone. • Some community practices further complicate the issue. • For example, online wallets, such as coinbase.com, buy/sell coins among its customers without using transactions; ownership of an address is changed by transferring the associated private keys to another user. • Although the user associated with the address changes, nothing gets recorded in the blockchain. • Clustering can be further improved by considering IP locations and temporal patterns.
  78. 78. Research directions in taint analysis • Money laundering M. Moser, R. Bohme, and D. Breuker. An inquiry into money laundering tools in the Bitcoin ecosystem. In: eCRS. IEEE. 2013, pp. 1-14. Huang, Danny Yuxing, Maxwell Matthaios Aliapoulios, Vector Guo Li, Luca Invernizzi, Elie Bursztein, Kylie McRoberts, Jonathan Levin, Kirill Levchenko, Alex C. Snoeren, and Damon McCoy. Tracking ransomware end-to-end. In 2018 IEEE Symposium on Security and Privacy (SP), pp. 618-631. IEEE, 2018. • Ransomware payments • Illicit trade/use Portnoff, Rebecca S., Danny Yuxing Huang, Periwinkle Doerfler, Sadia Afroz, and Damon McCoy. Backpage and bitcoin: Uncovering human traffickers. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1595-1604. ACM, 2017. • Personal blackmail S. Phetsouvanh, F. Oggier and A. Datta. EGRET: Extortion Graph Exploration Techniques in the Bitcoin Network. IEEE ICDM Workshop on Data Mining in Networks (DaMNet). IEEE, 2018.
  79. 79. Thanks for attending! Cuneyt.Akcora@utdallas.edu Further reading -> Blockchain: A graph primer. C. G. Akcora, Y. R. Gel, M. Kantarcioglu. [Updated regularly, online] ArXiv:1708.08749, pp 1-16, 2017. BlockchainTutorial.Github.io
  80. 80. 81 Outline 1. Descriptive summaries 2. Price models 3. Risk models  Value at Risk estimates  GARCH family 4. Models using local blockchain network features 4- Statistical Analysis of the Cryptocurrency Price Formation
  81. 81. Price and returns of cryptocurrency Let 𝑦𝑡 be the price of a cryptocurrency. Returns of prices measure the relative change in prices. • Simple returns: 𝑅𝑡 = 𝑦𝑡 − 𝑦𝑡−1 𝑦𝑡−1 = 𝑦𝑡 𝑦𝑡−1 − 1 Benefit of using returns versus prices is normalization. Measures all variables in a comparable metric. • Log returns: 𝑟𝑡 = log 𝑦𝑡 − log 𝑦𝑡−1 = log 𝑦𝑡 𝑦𝑡−1 Log returns are additive. Again if we assume that prices are distributed log normally (which, in practice, may or may not be true for any given price series), then log transformation results in approximately normal returns, which are easier to work with. 82
  82. 82. 83 Bitcoin price and log returns The price has an upward trend and it is volatile which is also clear from log returns.
  83. 83. Summary statistics – log returns The summary statistics are the largest for Bitcoin, followed by Dash, Litecoin, Monero, Ripple, Maidsafecoin and Dogecoin (Chu et al., 2017). The log returns for each cryptocurrency are positively skewed. 84 Chu, Jeffrey, Stephen Chan, Saralees Nadarajah, and Joerg Osterrieder. GARCH modelling of cryptocurrencies. Journal of Risk and Financial Management 10, no. 4 (2017): 17.
  84. 84. 85 Summary statistics – log returns cont.… Log returns are more or less symmetrically distributed. Some histograms appear more peaked than others. The histogram of the log returns of the exchange rates from June 2014 – May 2017. Chu, Jeffrey, Saralees Nadarajah, and Stephen Chan. Statistical analysis of the exchange rate of bitcoin. PloS one10, no. 7 (2015).
  85. 85. 86 Models for cryptocurrency price and volatility • Price models • Risk models: a) Value at Risk estimates via fitting parametric distributions. b) GARCH family • Models using local block chain network features
  86. 86. 87 Predictive models In order to avoid spurious regression, we need to test stationarity and interdependency properties (Jang and Lee, 2017). Jang, Huisu, and Jaewook Lee. An empirical study on modeling and prediction of bitcoin prices with bayesian neural networks based on blockchain information. IEEE Access 6 (2018): 5427-5437. According to regressions of interdependent and non- stationary time series may lead to spurious results. Engle, Robert F., and Clive WJ Granger. Co-integration and error correction: representation, estimation, and testing. Econometrica: journal of the Econometric Society (1987): 251-276.
  87. 87. 88 Time plots Bitcoin price has a positive time trend and shows a clear non-stationarity. Every log returns graph shows periods of very high volatility and periods of relative tranquility which is a common feature among financial assets. Dyhrberg, Anne Haubo. Bitcoin, gold and the dollar–A GARCH volatility analysis. Finance Research Letters 16 (2016): 85-92.
  88. 88. 89 Transaction activity and price Price, number of transaction and number of unique address exhibit similar upward pattern. The log returns shows that they are volatile. Koutmos, Dimitrios. Bitcoin returns and transaction activity. Economics Letters 167 (2018): 81-85.
  89. 89. 90 Stationarity and cointegration tests • Test for the stationarity: The augmented Dickey-Fuller (ADF) test: ∆𝑦𝑡 = 𝛼 + 𝛽𝑡 + 𝛾𝑦𝑡−1 + 𝛿1∆𝑦𝑡−1 + … + 𝛿 𝑝−1∆𝑦𝑡−𝑝+1 + 𝜖 𝑡 where 𝛼 is a constant, 𝛽 the coefficient on a time trend and 𝑝 the lag order of the autoregressive process. 𝐻0: 𝛾 = 0 against 𝐻𝐴: 𝛾 < 0. Test statistics: 𝐷𝐹 = 𝛾/𝑠𝑒( 𝛾) • Cointegration test: Two time series are considered to be cointegrated if there exists a long-run equilibrium relationship between them. Engle-Granger Cointegration test: If 𝑥𝑡 and 𝑦𝑡 are non-stationary and cointegrated, then a linear combination of them must be stationary. In other words: 𝑦𝑡 − 𝛽𝑥𝑡 = 𝑢 𝑡 where 𝑢 𝑡 is stationary.
  90. 90. 91 Granger Causality Test The causality test assesses whether one time series is useful in predicting another. F 𝑡+ℎ ∙ |ℱ 𝒀,𝑿,𝒁 𝟏,…,𝒁 𝒌 𝑡−1 = F 𝑡+ℎ ∙ |ℱ(𝒀,𝒁 𝟏,…,𝒁 𝒌) 𝑡−1 • Then, 𝑿 𝒕−𝟏 is said not to Granger cause (G-cause) 𝒀 𝒕+𝒉 with respect to ℱ(𝒀,𝒁 𝟏,…,𝒁 𝒌) 𝑡−1 . • Otherwise, 𝑿 is said to G-cause 𝒀, which can be denoted by 𝐺 𝑿→𝒀. • → represents the direction of causality.
  91. 91. 92 Granger Causality Test cont.… For univariate case consider time series 𝑦𝑡, 𝑥𝑡 and 𝑧𝑡. To test G-causality of 𝑥𝑡, we compare the fit of the full model 𝑦𝑡 = 𝛼0 + 𝑘=1 𝑑 𝛼 𝑘 𝑦𝑡−𝑘 + 𝑘=1 𝑑 𝛽 𝑘 𝑥𝑡−𝑘 + 𝑘=1 𝑑 𝛾 𝑘 𝑧𝑡−𝑘 + 𝑒𝑡 versus the fit of the reduced model 𝑦𝑡 = 𝛼0 + 𝑘=1 𝑑 𝛼 𝑘 𝑦𝑡−𝑘 + 𝑘=1 𝑑 𝛽 𝑘 𝑥𝑡−𝑘 + 𝑒𝑡 • Under the null hypothesis of no predictive effect in 𝑥 onto 𝑦 (i.e., x does not G-cause 𝑦), 𝑉𝑎𝑟 𝑒𝑡 = 𝑉𝑎𝑟 𝑒𝑡 . • If 𝑉𝑎𝑟 𝑒𝑡 is (statistically) significantly lower than 𝑉𝑎𝑟 𝑒𝑡 , then 𝑥 • contains additional information that can improve forecasting of 𝑦, i.e., 𝐺 𝑿→𝒀.
  92. 92. 93 Granger Causality Test - example In both case we can conclude that Bitcoin price realized volatility Granger-causes the VIX and the VIX Granger causes Bitcoin price realized volatility. Estrada, Julio Cesar Soldevilla. Analyzing Bitcoin Price Volatility. University of California, Berkeley (2017).
  93. 93. 94 Models • For stationary case standard OLS estimator can be used to estimate the model. • For non-stationary and non-cointegrated series we estimate a multivariate vector auto regressive (VAR) model. • When the time series are considered to be cointegrated the Vector Error Correction (VEC) model is suitable for estimation. • A combination of above models is also used.
  94. 94. Machine learning methods • Neural network (NN) and its different versions, e.g., RNN, BNN, CNN, etc. • Random Forest (RF) • Support Vector Regression (SVR) 95
  95. 95. 96 Model evaluation Root mean squared error (RMSE) Mean absolute percentage error (MAPE) where 𝑦𝑡 is the Bitcoin price and 𝑦𝑡 is the corresponding predicted value. = 1 𝑛 𝑡=1 𝑛 𝑦𝑡 − 𝑦𝑡 2 = 1 𝑛 𝑡=1 𝑛 𝑦𝑡 − 𝑦𝑡 𝑦𝑡
  96. 96. Bayesian neural networks (BNN) 97 • BNN models outperform other models in terms of RMSE and MAPE for predicting the log price of Bitcoin for 1-day ahead forecast. • BNN is more reliable for describing the process of log volatility than other benchmark models (Jang and Lee, 2017)
  97. 97. Bayesian neural networks (BNN) 98 We observe that Bayesian neural networks capture the patterns of the Bitcoin prices better than other models (Jang and Lee, 2017).
  98. 98. 99 Risk Models: Tests for randomness and no serial correlation p-values for randomness tests and serial correlation tests are all greater than 0.05, therefore Bitcoin log returns and squares of log returns are random and uncorrelated. (Chu et al., 2015).
  99. 99. Model selection techniques 100 Akaike’s information criteria (AIC) and the Bayesian information criterion (BIC): 𝐴𝐼𝐶 = −2ℓ + 2𝑘 𝐵𝐼𝐶 = −2ℓ + 𝑘 ln 𝑛 where, ℓ is the maximized log-likelihood function of the model and 𝑘 is the number of parameters in the model. • The model with smallest AIC and BIC are considered as the “best" model. Graphical approach: • Quantile-Quantile (Q-Q) plot. • Observed versus fitted density are the popular techniques for model diagnostic.
  100. 100. 101 Model selection for log returns Overall, the generalized hyperbolic distribution gives the best fit by having the smallest values for ln L, AIC, AICc, BIC (Chu et al., 2015).
  101. 101. 102 Model selection for Bitcoin log returns cont.… The QQ plot, probability plot and the density plot of the fitted generalized hyperbolic distribution suggest that the fit is good. The fit appears reasonable also in the tails (Chu et al., 2015).
  102. 102. 103 Bitcoin Value at Risk (VaR) Value at Risk is the maximum loss, which should not be exceeded during a specified period of time with a given probability level. Let 𝑓(𝑥) be the probability density function of this distribution. 𝑃 𝑋 ≤ −𝑉𝑎𝑅 1 − 𝛼 = 𝛼 −∞ −𝑉𝑎𝑅 1−𝛼 𝑓 𝑥 𝑑𝑥 = 𝛼 The fitted values for the VaR appears very close to the historical estimates (Chu et al., 2015).
  103. 103. 104 GARCH Modelling The GARCH (𝑝, 𝑞) model for a Bitcoin price returns, 𝑟𝑡, is defined as 𝑟𝑡 = 𝜎𝑡 𝜖 𝑡 𝜎𝑡 2 = 𝑤0 + 𝑖=1 𝑞 𝑤𝑖 𝑟𝑡−𝑖 2 + 𝑗=1 𝑝 𝜏𝑖 𝜎𝑡−𝑗 2 where 𝑤0 > 0, 𝑤𝑗 > 0, 𝜏𝑗 > 0, 𝜖 𝑡~IID 0,1 , 𝑖 = 1,2, … , 𝑞, 𝑗 = 1,2, … , 𝑝. To assess how explanatory variables influence the volatility of the Bitcoin price we can employ a GARCH-X model: 𝜎𝑡 2 = 𝑤0 + 𝑖=1 𝑞 𝑤𝑖 𝑟𝑡−𝑖 2 + 𝑗=1 𝑝 𝜏𝑖 𝜎𝑡−𝑗 2 + Λ𝑋𝑡
  104. 104. 105 GARCH family • Variety of GARCH model: SGARCH, EGARCH, GJRGARCH, APARCH, IGARCH, CSGARCH, GARCH, TGARCH, etc. • We select best model based on different model selection criteria, e.g., AIC, BIC etc.
  105. 105. 106 GARCH model for Bitcoin The second model is the exponential GARCH model which investigates if the return on Bitcoin is asymmetrically affected by good and bad news (known as the leverage effect) (Dyhrberg, 2016).
  106. 106. 107 GARCH model for Bitcoin cont. … • Exchange rates suggest that Bitcoin returns are more sensitive to the value of the dollar relative to the £, than to value of the $ relative to the €. • Therefore regional or country specific effects are present. Mean equation
  107. 107. 108 GARCH model for Bitcoin cont. … • Bitcoin return will have a lower volatility than the dollar when there is a positive volatility shock to the federal funds rate. Variance equation • a positive volatility shock to the dollar-sterling exchange rate decreases the variance of the Bitcoin returns. • This may indicate that Bitcoin is a relatively safe asset in such a situation (Dyhrberg, 2016).
  108. 108. Bitcoin volatility - machine learning methods Features: Methods considered: EWMA, ARIMA, ARIMAX, RF, GBT, XGT etc. Guo, Tian, and Nino Antulov-Fantulin. Predicting short-term bitcoin price fluctuations from buy and sell orders. arXiv preprint arXiv:1802.04065 (2018). Modeling Realized volatility
  109. 109. Bitcoin vol.- machine learning methods cont. … ● The simple EWMA can beat all others in some intervals. ● Simply adding features from order book does not necessarily improve the performance. ● Models like ARIMAX and STRX are prone to overfit by redundant data of long horizon, while ensemble method XGT, and ENET are relatively robust to the horizon.
  110. 110. 111 Models with Local blockchain network features Local higher-order structures of complex networks, or multiple-node subgraphs, are found to be an indispensable tool for analysis of 1. robustness of biological networks (Milo et al., 2002) 2. robustness of power grid (Dey et al., 2017) 3. functionality and early warning stability indicators in financial networks (Jiang et al., 2014) Milo, Ron, Shai Shen-Orr, Shalev Itzkovitz, Nadav Kashtan, Dmitri Chklovskii, and Uri Alon. Network motifs: simple building blocks of complex networks. Science 298, no. 5594 (2002): 824-827. Dey, Asim Kumer, Yulia R. Gel, and H. Vincent Poor. Motif-based analysis of power grid robustness under attacks. In Signal and Information Processing (GlobalSIP), 2017 IEEE Global Conference on, pp. 1015-1019. IEEE, 2017. Jiang, X. F., T. T. Chen, and B. Zheng. Structure of local interactions in complex financial dynamics. Scientific reports4 (2014): 5321.
  111. 111. 112 Models with Local blockchain network features Local higher-order structures of complex networks, or multiple-node subgraphs: Three distinct types of 1-chainlets!
  112. 112. 113 Blockchain network features In contrast to fiat currencies, transactions of cryptocurrencies are permanently recorded on distributed ledgers to be seen by the public. A natural analytics approach is then to ask the following three interlinked questions (Akcora et al., 2018): 1. Do changes in chainlet characteristics exhibit any causal effect on future cryptocurrency price and returns? 2. Do chainlets convey some unique information about future cryptocurrency prices, given more conventional economic variables and non-network blockchain characteristics? 3. Do chainlets dynamics of one cryptocurrency have influence on price and volatility of other cryptocurrency? Cuneyt G. Akcora, Asim Kumer Dey, Yulia R. Gel, and Murat Kantarcioglu. Forecasting Bitcoin Price with Graph Chainlets. In Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 765- 776. Springer, Cham, 2018. Asim Kumer Dey, Akcora, Cuneyt G., Yulia R. Gel, and Murat Kantarcioglu. On the Role of Local Blockchain Network Features in Cryptocurrency Price Formation. 2019.
  113. 113. 114 Chainlet Predictive Utility in Price (Akcora et al., 2018)
  114. 114. 115 Cryptocurrency Price Prediction with Chainlets (Dey et al., 2019) The comparison study based only on the Random Forest (RF) type of models.
  115. 115. 116 Model comparison Predictive utilities of a model over the baseline model can be measured as Ψ 𝑋→𝑌 = 𝜓 𝐵𝑖 𝜓 𝐵0 • 𝜓 is a measure of prediction error, e.g., root mean squared error (RMSE). • If Ψ 𝑋→𝑌 < 1, the 𝑐𝑜𝑣𝑎𝑟𝑖𝑎𝑡𝑒(𝑿) is said to improve prediction of 𝑌. The percentage change in 𝜓 for a specific model w.r.t. 𝐵0 as Δ = 1 − 𝛹 𝑋→𝑦 × 100% (Akcora et al., 2018)
  116. 116. 117 Cryptocurrency Price Prediction with Chainlets • For short to moderate term (up to 15 days ahead) forecasting horizons, model B2, solely based on Bitcoin occurrences, yields more accurate performance, although closely followed by models B3 and B4 (Akcora et al., 2018). • For longer term forecasting horizons, i.e., more than 15 days ahead, model B4, containing information from both Bitcoin and Litecoin, delivers the most competitive results, followed by model B2.
  117. 117. 118 Analyzing Price Volatility with Chainlets To assess how chainlets variables influence the volatility of the Bitcoin price we employ a GARCH-X model with the explanatory variables 𝜎𝑡 2 = 𝑤0 + 𝑖=1 𝑞 𝑤𝑖 𝑟𝑡−𝑖 2 + 𝑗=1 𝑝 𝜏𝑖 𝜎𝑡−𝑗 2 + Λ𝑋𝑡 where X = [𝕆 ℂ1→7 ₿ 𝕆 ℂ20→3 ₿ 𝕆 ℂ3→3 ₿ 𝕆 Bitcoin cluster 7 𝔸 ℂ3→4 ₿ 𝔸 ℂ20→20 ₿ ] , Λ = 𝜆1 𝜆2 … 𝜆6 ′. • All the explanatory variables are in the form of log returns. • GARCH(1,1) model. • 𝜖 𝑡~ N(0,1). (Dey et al., 2019)
  118. 118. 119 Analyzing Price Volatility with Chainlets cont... The model with chainlet covariates, Model X, tends to describe the Bitcoin price volatility more accurately than the volatility model without chainlet covariates i.e., Model 0 (Dey et al., 2019).
  119. 119. 120 Future Research Directions • Relationship between transaction networks of multiple cryptocurrencies and health of crypto eco-system. • Network features of cryptocurrencies transactions as a proxy for market sensing. • Ensemble forecasting of fiat currencies with cryptocurrencies features.
  120. 120. Thanks for attending! Yulia R. Gel ygl@utdallas.edu BlockchainTutorial.Github.io