The document discusses data, data science, and finding data sources. It defines data as raw facts about the world and notes that data comes from various sources like government, scientific research, citizens, and private companies. It then discusses the growth of digital data and issues around open data. The document defines data science as using analysis methods to describe facts, detect patterns, and test hypotheses. Finally, it provides tips on finding needed data, such as searching open data sources, APIs, scraping, and joining datasets.
2. I. What do we mean when we talk about data and where does data come from? II. What is data science? III. Where do you find the data you need? IV. Science Module: Niche Modelling
3. What do we mean when we talk about data? Session III
4. We ask many many questions about the world around us.
5. To answer those questions accurately requires a body of data and a set of tools to perform analyses that will lead us toward an answer
6. “ The goal is to transform data into information, and information into insight” Carly Fiorina
12. A lot of this data is available for you to use
13. Where does the data come from? Government http://www.data.gov / http://data.gov.uk/ http://www.census.gov/ Scientific research http://www.ncbi.nlm.nih.gov/genbank/ http://www.gbif.org/ http://earthengine.googlelabs.com Semi-automated and large scale collections http://eospso.gsfc.nasa.gov/ http://www.airquality.co.uk/autoinfo.php http://www.statistics.gov.uk/ For profit http://www.flickr.com/ http://www.google.com/trends http://www.facebook.com/data Citizens http://www.wikipedia.org / http://protectedplanet.net/ http://www.openstreetmap.org/
14.
15. Direct data Government http://www.data.gov / http://data.gov.uk/ http://www.census.g ov/ Scientific research http://www.ncbi.nlm.nih.gov/genbank/ http://www.gbif.org/ http://earthengine.googlelabs.com Semi-automated and large scale collections http://eospso.gsfc.nasa.gov / http://www.airquality.co.uk/autoinfo.php http://www.statistics.gov.uk/ For profit http://www.flickr.com/ http://www.google.com/trends http://www.facebook.com/data Citizens http://www.wikipedia.org / http://protectedplanet.net/ http://www.openstreetmap .org/
32. '...survey a portion of our region of the Milky Way galaxy to discover dozens of Earth-size planets in or near the habitable zone and determine how many of the billions of stars in our galaxy have such planets...'
33. Launch a telescope into space for orbit around the sun Collect data from thousands of stars Spend 3 months trying to detect the presence of plants in orbit around these stars RELEASE THE DATA FOR USE BY ANYONE!!!
38. Data analysis is a body of methods that help to describe facts, detect patterns, develop explanations, and test hypotheses. It is used in all of the sciences. It is used in business, in administration, and in policy. Levine, 1997, Introduction to Data Analysis: The Rules of Evidence
39. “ The goal is to transform data into information, and information into insight” Carly Fiorina
40. It is a set of skills performed often but not exclusively by scientists
41. The availability of data on the internet is making data analysis accessible to anyone
51. Open government data, Large published datasets such as Wikipedia, Flickr, or public FusionTables, Natural sciences resources, Weather, Atmosphere and Geographic datasets
52. See the growing list of datasets our class will uncover https://github.com/andrewxhill/DMID/wiki/Datasets
54. Maybe you can download it directly from the website. Pay attention to what formats you find, some will be easier for you to use than others.
55. Sometimes, APIs, or application programming interfaces are available. These are probably for people with a bit more programming experience, but if you need the data ask the instructors and we might be able to help
56.
57. Scraping Generally the hardest method, as it means programmatically pulling data from sources not necessarily designed to have data pulled from.
58.
59. Linked data This is an evolution of both scraping and APIs, where many web resources are now designed to be both human readable and programatically navigable.
60.
61. Remember that all of these data sources have different formats and potential sources of error, we will have a full session on data preparation, cleaning, and analysis
62. Remember that data comes in many shapes and sizes, be aware of what you find General formats - png, jpg, xls, doc Categorical and storage - csv, sql Geographic - shp, tif, asc
Humans are naturally curious about our world. \nIn addition, we have social, economic, and personal motivations to understand how and why the world around us changes\n
\n
CEO HP\n
1993 David Vaughan British Anta Survey\nPredicted breaking in 30yrs\n2008 he conceded that his estimates had been to conservative\n
\n
\n
\n
Automated weather station \nLake Vida Antarctica\n19 Meters of Ice\n2500 years\n\n
Automated weather station \nLake Vida Antarctica\n19 Meters of Ice\n2500 years\n\n
\n
\n
\n
\n
\n
\n
\n
\n
Genbank\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
Not everyone has the means to take this data and study it in a sophisticated analysis\nBut a lot of people are interested in space, astronomy, and our universe\nSo how could Kepler insure that these people could help them and have fun\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
You will not likely encounter this anytime over the next couple of weeks\nbut it is best to be aware of\n
\n
\n
\n
XLS might be easier for you to navigate\nopen it in excel, sort columns, search for what you want\nbut CSV will almost always be easier to use anyplace other than excel\nsmaller, compact, but easily parsable\n
This is not the linked data\n
ISO - International organization for standards\n\n