Social Web: (Big) Data Mining | ISS FSV UK | Charles University in Prague | Faculty of Social Sciences | Institute of Sociological Studies | bachelor’s course | JSB454 | summer semester 2014/2015
Course Syllabus (version 1.1)
Introduction to Data Mining & Data Analysis | Data Science | Digital Humanities
Big Data | Types of Data | Data Formats | Information Retrieval | Business Intelligence | Law & Ethics of Data Mining
Introduction to Web Technologies for Non-Tech Students | Database Systems | Web Programming | Semantic Web | APIs
Graph Theory | Social Network Analysis | Statistical Procedures, Apps&Tools
Pseudocoding | Introduction to Programming in Python & data mining alternatives comparison | Data Exploration & Preprocessing
Web Scraping | Data Cleaning & Processing | Python Implementation &Libraries, Statistical Procedures, Apps &Tools
Social Media Mining | Data Cleaning & Processing | Python Implementation &Libraries, Statistical Procedures, Apps &Tools
Text Mining | Natural Language Processing | Python Implementation &Libraries, Statistical Procedures, Apps &Tools
Data Visualization | Data Storytelling | Electronic Publishing | Python Implementation & Libraries, Statistical Procedures, Apps & Tools
Student Webinars Week |Introducing Various Free &Open Source Data Mining Software &Apps
Machine Learning, Recommender Systems & OtherMoreAdvanced Topics | Large-ScaleDataSets| MapReduce, Hadoop, NoSQL
Course Review | Semestral Projects Consultation & Adjustments | The Remaining 99% of Data Science | Data Science Buzzwords
Social Web: (Big) Data Mining | summer 2014/2015 course syllabus
1. JAKUB RŮŽIČKA jameslittlerose@gmail.com cz.linkedin.com/in/littlerose
summer semester 2014/2015
SOCIAL WEB:
(BIG) DATA MINING
bachelor‘s course | ISS FSV UK | JSB454
course syllabus
[version 1.1]
4. outline
Social Web: (Big) Data Mining
The course gives
a professional and academic
introduction to web & social
media data mining.
Emphasis is put on the
intersection of data science,
humanities & ICT.
• PhDr. Mgr. Ing.
Petr Soukup
• Jakub Růžička
guarantors
• Jakub Růžička
• Petr Soukuplecturers
• 7 ECTS
• elective coursecredits
• 1 lecture (80min) &
1 tutorial/seminar (80min)
per week
lectures
6. outline
Upon completion of the course, the students
will be able to
understand the intersection of
data science, humanities & ICT
within the realm of web & social
media (big) data mining
ask meaningful questions,
perform basic analytical
operations regarding both,
structured & unstructured web /
social media data and draw
conclusions for decision making
understand basic concepts and
conduct subsequent data
preprocessing, analysis &
visualization related to social
network analysis, web mining,
social media mining & text
mining
take a positive approach
towards data science &
computer programming, gain
confidence in basic operations
and use or modify a third party
(open) source code or an
analytical procedure/tool
describe advanced data mining
methods & applications for
further self education
(or subsequent institutional
education)
or professional/academic
specialization
8. outline
Course Overview
lectures are followed by tutorials in order to put knowledge into practice
the exact dates & content of the lectures may be subject to change based on pace & requirements of the course group
• Introduction to Data Mining & Data Analysis | Data Science | Digital HumanitiesLecture #1
• Big Data | Types of Data | Data Formats | Information Retrieval | Business Intelligence | Law & Ethics of Data MiningLecture #2
• Introduction to Web Technologies for Non-Tech Students | Database Systems | Web Programming | Semantic Web | APIsLecture #3
• Graph Theory | Social Network Analysis | Statistical Procedures, Apps & ToolsLecture #4
• Pseudocoding | Introduction to Programming in Python & data mining alternatives comparison | Data Exploration & PreprocessingLecture #5
• Web Scraping | Data Cleaning & Processing | Python Implementation & Libraries, Statistical Procedures, Apps & ToolsLecture #6
• Social Media Mining | Data Cleaning & Processing | Python Implementation & Libraries, Statistical Procedures, Apps & ToolsLecture #7
• Text Mining | Natural Language Processing | Python Implementation & Libraries, Statistical Procedures, Apps & ToolsLecture #8
• Data Visualization | Data Storytelling | Electronic Publishing | Python Implementation & Libraries, Statistical Procedures, Apps & ToolsLecture #9
• Student Webinars Week | Introducing Various Free & Open Source Data Mining Software & AppsLecture #10
• Machine Learning, Recommender Systems & Other More Advanced Topics | Large-Scale DataSets | MapReduce, Hadoop, NoSQLLecture #11
• Course Review | Semestral Projects Consultation & Adjustments | The Remaining 99% of Data Science | Data Science BuzzwordsLecture #12
10. outline
Types of Instruction & Workload
the course consists of
• lectures
• tutorials/seminars
• guest lectures
(possibly webinars)
• student webinars
background, how-to,
support & inspiration
during lectures
& tutorials/seminars and
online course materials for
self-directed students
workload | 150 hours
• lectures 16h
• tutorials/seminars 16h
• assignments
• team project 70h
• webinar 20h
• self-study 28h
11. outline
Teaching Method & Related Information
storytelling
• the course topics will be tied togehter via
obtaining real-time (& real-life) data for
decision making of a fictional political party
• teams of 2-3 students will be formed as a
response to a need of studying more
specific area of the political campaign |
teams will be differentiated based on a
specific topic/area of interest rather than
types of analyses
collaboration
• teamwork & knowledge sharing will be
strongly encouraged & facilitated
| collaboration has its downsides as well
but since there are too many ‘individual
work‘ courses & too few ‘team work‘
courses, let‘s try work together for a
change
BYOD Bring Your Own Device
• several software packages requiring
installation & personalization will be used
within the course
• BYOD is therefore recommended
beginner quite =) friendly
• although the course might be challenging for
students with no analytical or computing
background (introductory-level courses or
professional experience), most of the time, you
won‘t be required to create/write your own
computer code ‘from scratch‘ (that would require
another course) but you‘ll be provided with a
working code (explained in a pseudocode) that
you‘ll customize
• user-level knowledge of social media is assumed
12. Requirements,
Examination
& Assignments
(I.) 30% Webinar collaborative, teams of 2-3
(II.) 70% Project/Research collaborative, teams of 2-4
* the percentage stands for the significance of
the assignment regarding the final grade
13. outline
Grading
the grade is calculated on
WEBINAR (30%) and
PROJECT/RESEARCH
defence (70%)
the course is graded
A (>=85%), B (>=70%),
C (>=60%), D (>=50%),
or E (<50%)
A, B or C is needed to pass
the course
14. outline
(I.) Webinar 30% collaborative, teams of 2-3 students
assignment
• 1) familiarize yourself (in brief) with an assigned
data mining tool or application (you might also
choose your own if approved by the lecturer) and
introduce it
• 2) replicate an analysis (cite your source) using the
tool and explain the procedure & background
information
• 3) prepare a short (5-15min) live webinar for your
classmates & answer their questions (questions
regarding your particular analysis only)
• 4) let them do peer assessment of your work
motivation
• the volume of various data science free & open
source procedures, tools & applications grows
rapidly, so you definitely won‘t ‘be done‘ after
passing this course
• the volume of open educational resources (text,
video, interactive etc.) is huge, the tools are usually
well-documented & include sample analyses
provided by the creators or by its community
• you‘ ll learn most by a hands-on approach
and you‘ll get feedback from your peers
• brief description of the tool
• what it is for
• how one can use it
• where one can get it & learn it
20%
• replication of an analysis
• background information
• clarity of the procedure60%
• question responses
• only questions related to the
particular analysis count (one
doesn‘t become an expert on a
tool replicating one analysis =))
20%
15. outline
(II.) Project/Research 70% collaborative, teams of 2-4 students
assignment
• 1) mine/scrape, analyze & visualize available
structured & unstructured web & social media
data related to your team‘s area of
specialization within the fictional political party
campaign planning
• 2) prepare an executive summary in a form of
storyline highlighting the most important
findings for decision making
• 3) defend your project/research (examination)
motivation
• preparation for conducting a commercial
or academic research including web & social
media data mining & related analyses
• an opportunity to try everything out ‘under
supervision‘ & get feedback on your work
• practicing teamwork skills, organizing &
division of labour within a larger work group /
institution
• executive summary, clarity &
coherence of the data story and
meeting all requirements on
analyses used
(see the next slide)
30%
• appropriateness & correctness of
mining procedures & analyses used
and of your data interpretation,
consideration of limitations of your
outcomes (critical context)
40%
• answers to questions regarding
procedures, analyses & other
‘technical‘ details of your
project/research
30%
16. outline
Disscussed within a project defence
& included in a project executive summary
the story of your data
(for decision making within
your specialization)
visualizations, descriptions,
theoretical background,
interpretations & highlights
social network analysis web scraping social media mining
text mining & natural
language processing
critical review of the project
& limitations of the
generalizability of your
research
analytical appendix
with a hyperlink to source
tables & datasets
‘technical‘ appendix
computations, programming
code, request, queries etc.
17. Course literature
& Documentations
• you are not required to read any of the following, but you might find it handy when
looking for inspiration, reference, sample analyses, sample code or when some part
of the course takes your interest so that you want to follow up with more in-depth
self-directed study
• further online/paperback study resources, tutorials, libraries, applications & tools will
be introduced within specific topics of the course
18. outline
Books
GOLBECK, Jennifer. ANALYZING THE
SOCIAL WEB. Amsterdam: Morgan
Kaufmann, 2013. ISBN 01-240-5531-1.
TSVETOVAT, Maksim and Alexander
KOUZNETSOV. SOCIAL NETWORK
ANALYSIS FOR STARTUPS. O'Reilly,
2011. ISBN 978-144-9306-465.
HANSEN, Derek, Ben SCHNEIDERMAN
and Marc SMITH. ANALYZING SOCIAL
MEDIA NETWORKS WITH NODEXL:
INSIGHTS FROM A CONNECTED
WORLD. Burlington, MA: Morgan
Kaufmann, 2011. ISBN 01-238-2229-7.
MURRAY, Scott. INTERACTIVE DATA
VISUALIZATION FOR THE WEB.
Sebastopol, CA: O'Reilly Media, 2013.
ISBN 14-493-6108-0.
STEELE, Julie and Noah ILIINSKY.
BEAUTIFUL VISUALIZATION.
Sebastopol, CA: O'Reilly, 2010. ISBN 14-
493-7986-9.
FRY, Ben. VISUALIZING DATA.
Sebastopol, CA: O´Reilly, 2007. ISBN 05-
965-1455-7.
19. outline
Books
MCKINNEY, Wes. PYTHON FOR DATA
ANALYSIS: DATA WRANGLING WITH
PANDAS, NUMPY, AND IPYTHON.
Beijing: O'Reilly Media. ISBN 978-
1449319793.
RUSSELL, Matthew A. MINING THE
SOCIAL WEB: DATA MINING
FACEBOOK, TWITTER, LINKEDIN,
GOOGLE , GITHUB, AND MORE. 2nd
ed. Sebastopol: O´Reilly, 2014. ISBN 978-
1-449-36761-9.
JANERT, Philipp K. DATA ANALYSIS
WITH OPEN SOURCE TOOLS.
Sebastopol, CA: O'Reilly. ISBN 05-968-
0235-8.
LUTZ, Mark. LEARNING PYTHON. 5th
ed. Beijing: O'Reilly Media, 2013. ISBN
978-1449355739.
BIRD, Steven, Ewan KLEIN and Edward
LOPER. NATURAL LANGUAGE
PROCESSING WITH PYTHON. Beijing:
O´Reilly, 2009. ISBN 978-0596516499.
PERKINS, Jacob. PYTHON TEXT
PROCESSING WITH NLTK 2.0
COOKBOOK. Birmingham, UK: Packt
Publishing, 2010. ISBN 978-1849513609.
20. outline
Books
O'NEIL, Cathy and SCHUTT, Rachel.
DOING DATA SCIENCE. Sebastopol, CA:
O'Reilly, 2013. ISBN 14-493-5865-9.
RAJARAMAN, Anand and Jeffrey
ULLMAN. MINING OF MASSIVE
DATASETS. Cambridge: Cambridge
University Press, 2012. ISBN 11-070-
1535-9.
NORTH, Matthew. DATA MINING FOR
THE MASSES. Global Text Project, 2012.
ISBN 06-156-8437-8.
PROVOST, Foster. DATA SCIENCE FOR
BUSINESS: WHAT YOU NEED TO
KNOW ABOUT DATA MINING AND
DATA-ANALYTIC THINKING.
Sebastopol, CA: O´Reilly. ISBN 978-1-
449-36132-7.
MINELLI, Michael, Michael CHAMBERS
and DHIRAJ, Ambiga. BIG DATA BIG
ANALYTICS: EMERGING BUSINESS
INTELLIGENCE AND ANALYTIC
TRENDS FOR TODAY'S BUSINESSES.
Wiley, 2013. ISBN 111814760X.
BOSLAUGH, Sarah. STATISTICS IN A
NUTSHELL. 2nd ed. Farnham, Surrey,
England: O'Reilly, 2012. ISBN 14-493-
1682-4.
22. outline
self-directed learners, those who prefer distance/blended learning, those who want to know more,
or those who don‘t want to rely on one source of information only might want to
Complement/substitute different parts of the course on
Coursera MIT
OpenCourseWare
Stanford ONLINE edX
KhanAcademy Codecademy and many other
Google it & learn it
resources
or YouTube it &
watch it =)
23. JAKUB RŮŽIČKA jameslittlerose@gmail.com cz.linkedin.com/in/littlerose
summer semester 2014/2015
SOCIAL WEB:
(BIG) DATA MINING
bachelor‘s course | ISS FSV UK | JSB454
course proposal
[version 1.1]