Course Outline

Each session is 2 hours

Day-1: Session -1: Business Overview of Why Big Data Business Intelligence for Government

  • Case Studies from NIH, DoE
  • Big Data adaptation rate in Government Agencies and how they are aligning their future operations around Big Data Predictive Analytics
  • Broad Scale Application Area in DoD, NSA, IRS, USDA, etc.
  • Interfacing Big Data with Legacy data
  • Basic understanding of enabling technologies in predictive analytics
  • Data Integration & Dashboard visualization
  • Fraud management
  • Business Rule/Fraud detection generation
  • Threat detection and profiling
  • Cost-benefit analysis for Big Data implementation

Day-1: Session-2: Introduction to Big Data-1

  • Main characteristics of Big Data—volume, variety, velocity, and veracity. MPP architecture for volume.
  • Data Warehouses – static schema, slowly evolving dataset
  • MPP Databases like Greenplum, Exadata, Teradata, Netezza, Vertica, etc.
  • Hadoop-Based Solutions – no conditions on the structure of the dataset.
  • Typical pattern: HDFS, MapReduce (crunch), retrieve from HDFS
  • Batch—suited for analytical/non-interactive
  • Volume: CEP streaming data
  • Typical choices – CEP products (e.g., Infostreams, Apama, MarkLogic, etc.)
  • Less production ready – Storm/S4
  • NoSQL Databases – (columnar and key-value): Best suited as an analytical adjunct to data warehouses/databases

Day-1: Session -3: Introduction to Big Data-2

NoSQL solutions

  • KV Store - Keyspace, Flare, SchemaFree, RAMCloud, Oracle NoSQL Database (OnDB)
  • KV Store - Dynamo, Voldemort, Dynomite, SubRecord, Mo8onDb, DovetailDB
  • KV Store (Hierarchical) - GT.m, Cache
  • KV Store (Ordered) - TokyoTyrant, Lightcloud, NMDB, Luxio, MemcacheDB, Actord
  • KV Cache - Memcached, Repcached, Coherence, Infinispan, EXtremeScale, JBossCache, Velocity, Terracoqua
  • Tuple Store - Gigaspaces, Coord, Apache River
  • Object Database - ZopeDB, DB40, Shoal
  • Document Store - CouchDB, Cloudant, Couchbase, MongoDB, Jackrabbit, XML-Databases, ThruDB, CloudKit, Prsevere, Riak-Basho, Scalaris
  • Wide Columnar Store - BigTable, HBase, Apache Cassandra, Hypertable, KAI, OpenNeptune, Qbase, KDI

Varieties of Data: Introduction to Data Cleaning Issues in Big Data

  • RDBMS – static structure/schema, doesn’t promote an agile, exploratory environment.
  • NoSQL – semi-structured, enough structure to store data without an exact schema before storing data
  • Data cleaning issues

Day-1: Session-4: Big Data Introduction-3: Hadoop

  • When to select Hadoop?
  • STRUCTURED - Enterprise data warehouses/databases can store massive data (at a cost) but impose structure (not good for active exploration)
  • SEMI STRUCTURED data – tough to do with traditional solutions (DW/DB)
  • Warehousing data = huge effort and static even after implementation
  • For variety & volume of data, crunched on commodity hardware – HADOOP
  • Commodity H/W needed to create a Hadoop Cluster

Introduction to MapReduce /HDFS

  • MapReduce – distribute computing over multiple servers
  • HDFS – make data available locally for the computing process (with redundancy)
  • Data – can be unstructured/schema-less (unlike RDBMS)
  • Developer responsibility to make sense of data
  • Programming MapReduce = working with Java (pros/cons), manually loading data into HDFS

Day-2: Session-1: Big Data Ecosystem - Building Big Data ETL: Universe of Big Data Tools - Which One to Use and When?

  • Hadoop vs. Other NoSQL solutions
  • For interactive, random access to data
  • Hbase (column-oriented database) on top of Hadoop
  • Random access to data but restrictions imposed (max 1 PB)
  • Not good for ad-hoc analytics, good for logging, counting, time-series
  • Sqoop - Import from databases to Hive or HDFS (JDBC/ODBC access)
  • Flume – Stream data (e.g., log data) into HDFS

Day-2: Session-2: Big Data Management System

  • Moving parts, compute nodes start/fail: ZooKeeper - For configuration/coordination/naming services
  • Complex pipeline/workflow: Oozie – manage workflow, dependencies, daisy chain
  • Deploy, configure, cluster management, upgrade, etc. (sys admin): Ambari
  • In Cloud: Whirr

Day-2: Session-3: Predictive Analytics in Business Intelligence -1: Fundamental Techniques & Machine Learning Based BI

  • Introduction to Machine learning
  • Learning classification techniques
  • Bayesian Prediction—preparing training file
  • Support Vector Machine
  • KNN p-Tree Algebra & vertical mining
  • Neural Network
  • Big Data large variable problem - Random forest (RF)
  • Big Data Automation problem – Multi-model ensemble RF
  • Automation through Soft10-M
  • Text analytic tool—Treeminer
  • Agile learning
  • Agent-based learning
  • Distributed learning
  • Introduction to Open Source Tools for predictive analytics: R, Rapidminer, Mahut

Day-2: Session-4: Predictive Analytics Ecosystem -2: Common Predictive Analytic Problems in Government

  • Insight analytic
  • Visualization analytic
  • Structured predictive analytic
  • Unstructured predictive analytic
  • Threat/fraudstar/vendor profiling
  • Recommendation Engine
  • Pattern detection
  • Rule/Scenario discovery—failure, fraud, optimization
  • Root cause discovery
  • Sentiment analysis
  • CRM analytic
  • Network analytic
  • Text Analytics
  • Technology-assisted review
  • Fraud analytic
  • Real-Time Analytic

Day-3: Session-1: Real-Time and Scalable Analytic Over Hadoop

  • Why common analytic algorithms fail in Hadoop/HDFS
  • Apache Hama—Bulk Synchronous distributed computing
  • Apache SPARK—Cluster computing for real-time analytic
  • CMU Graphics Lab2—Graph-based asynchronous approach to distributed computing
  • KNN p-Algebra based approach from Treeminer for reduced hardware cost of operation

Day-3: Session-2: Tools for eDiscovery and Forensics

  • eDiscovery over Big Data vs. Legacy data—a comparison of cost and performance
  • Predictive coding and technology-assisted review (TAR)
  • Live demo of a TAR product (vMiner) to understand how TAR works for faster discovery
  • Faster indexing through HDFS—velocity of data
  • NLP or Natural Language processing—various techniques and open-source products
  • eDiscovery in foreign languages—technology for foreign language processing

Day-3: Session-3: Big Data BI for Cyber Security – Understanding Whole 360-Degree Views of Speedy Data Collection to Threat Identification

  • Understanding basics of security analytics—attack surface, security misconfiguration, host defenses
  • Network infrastructure/large datapipe/Response ETL for real-time analytic
  • Prescriptive vs. predictive—Fixed rule-based vs. auto-discovery of threat rules from metadata

Day-3: Session-4: Big Data in USDA: Application in Agriculture

  • Introduction to IoT (Internet of Things) for agriculture—sensor-based Big Data and control
  • Introduction to Satellite imaging and its application in agriculture
  • Integrating sensor and image data for soil fertility, cultivation recommendation, and forecasting
  • Agriculture insurance and Big Data
  • Crop loss forecasting

Day-4: Session-1: Fraud Prevention BI from Big Data in Government—Fraud Analytics:

  • Basic classification of fraud analytics—rule-based vs. predictive analytics
  • Supervised vs. unsupervised machine learning for fraud pattern detection
  • Vendor fraud/overcharging for projects
  • Medicare and Medicaid fraud—fraud detection techniques for claim processing
  • Travel reimbursement frauds
  • IRS refund frauds
  • Case studies and live demos will be given wherever data is available.

Day-4: Session-2: Social Media Analytic—Intelligence Gathering and Analysis

  • Big Data ETL API for extracting social media data
  • Text, image, metadata, and video
  • Sentiment analysis from social media feed
  • Contextual and non-contextual filtering of social media feed
  • Social Media Dashboard to integrate diverse social media
  • Automated profiling of social media profiles
  • Live demo of each analytic will be given through Treeminer Tool.

Day-4: Session-3: Big Data Analytic in Image Processing and Video Feeds

  • Image storage techniques in Big Data—storage solutions for data exceeding petabytes
  • LTFS and LTO
  • GPFS-LTFS (Layered storage solution for big image data)
  • Fundamentals of image analytics
  • Object recognition
  • Image segmentation
  • Motion tracking
  • 3-D image reconstruction

Day-4: Session-4: Big Data Applications in NIH:

  • Emerging areas of Bioinformatics
  • Meta-genomics and Big Data mining issues
  • Big Data Predictive analytics for Pharmacogenomics, Metabolomics, and Proteomics
  • Big Data in downstream Genomics processes
  • Application of Big Data predictive analytics in Public health

Big Data Dashboard for Quick Accessibility of Diverse Data and Display:

  • Integration of existing application platforms with Big Data Dashboard
  • Big Data management
  • Case study of Big Data Dashboard: Tableau and Pentaho
  • Use Big Data apps to push location-based services in government
  • Tracking system and management

Day-5: Session-1: How to Justify Big Data BI Implementation Within an Organization:

  • Defining ROI for Big Data implementation
  • Case studies for saving analyst time for collection and preparation of data—increase in productivity gain
  • Case studies of revenue gain from saving licensed database costs
  • Revenue gain from location-based services
  • Savings from fraud prevention
  • An integrated spreadsheet approach to calculate approximate expense vs. revenue gain/savings from Big Data implementation.

Day-5: Session-2: Step-by-Step Procedure to Replace Legacy Data System with a Big Data System:

  • Understanding practical Big Data Migration Roadmap
  • Important information needed before architecting a Big Data implementation
  • Different ways of calculating volume, velocity, variety, and veracity of data
  • How to estimate data growth
  • Case studies

Day-5: Session-4: Review of Big Data Vendors and Their Products. Q&A Session:

  • Accenture
  • APTEAN (Formerly CDC Software)
  • Cisco Systems
  • Cloudera
  • Dell
  • EMC
  • GoodData Corporation
  • Guavus
  • Hitachi Data Systems
  • Hortonworks
  • HP
  • IBM
  • Informatica
  • Intel
  • Jaspersoft
  • Microsoft
  • MongoDB (Formerly 10Gen)
  • MU Sigma
  • NetApp
  • Opera Solutions
  • Oracle
  • Pentaho
  • Platfora
  • QlikTech
  • Quantum
  • Rackspace
  • Revolution Analytics
  • Salesforce
  • SAP
  • SAS Institute
  • Sisense
  • Software AG/Terracotta
  • Soft10 Automation
  • Splunk
  • Sqrrl
  • Supermicro
  • Tableau Software
  • Teradata
  • Think Big Analytics
  • Tidemark Systems
  • Treeminer
  • VMware (Part of EMC)

Requirements

  • Fundamental knowledge of business operations and data systems for government within their domain
  • Basic understanding of SQL/Oracle or relational databases
  • Basic understanding of statistics (at the spreadsheet level)
 35 Hours

Number of participants


Price per participant

Testimonials (1)

Upcoming Courses

Related Categories