Creating a Banner Archive Holly Markel, Database Programmer/Analyst
33 Slides726.94 KB
Creating a Banner Archive Holly Markel, Database Programmer/Analyst Computing and Information Technology June 22, 2018 [email protected]
Who is this person and what is she doing here?!? Holly Markel Wrote my first computer programming at age of 8!
Who is this person and what is she doing here?!? (continued) Holly Markel Graduated from SVSU (a Colleague school) 3.5 years working in their Administrative Information Systems department 19 years working in Information Technology 13 years working in Databases/SQL 10 years working in student information systems 7 years working in data warehousing and ETL (Extract, Transform, and Load data) 3 years working for Hope College
A little about Hope College Founded in 1866 in partnership with the Reformed Church in America Four-year liberal arts college Located in Holland, MI Over 33,000 alumni
A little about Hope College’s Banner We have been on Banner for 20 years We use Banner for: Finance, A/R, HR, Payroll, Student, Financial Aid Transitioning from Banner Advancement to Blackbaud The Raiser’s Edge Using Hobson’s Radius for Admissions Currently around 3300 active students and 1000 Faculty/Staff Banner 8 on Oracle 11 Windows working on transitioning to Banner 9 on Oracle 12 Linux
A little about Hope College’s Data use We have around 40 reporting users accessing Banner data directly MS Access Tableau A few IT users using direct Oracle queries in Toad or SQL Developer Our own home-grown reports through a web application Many additional users go to this site Around 350 database views in the Banner database Around 50 materialized views that refresh nightly Biggest materialized view takes around 8 hours to refresh right now! 133 columns across 45,000 records
SSIS – What is it? SSIS: SQL Server Integration Services An ETL service built for data integrations Uses Visual Studio developer interface and debugging tool The Visual Studio 2012 version SSIS developer tools are free SSIS processes can be deployed and run in automation from a Microsoft SQL Server Hope College has created 48 separate SSIS processes for data integrations that we have running
Why have an archive of Banner data? Data as it was on the day of Many tables don’t store historical changes - OR - they don’t store them very well Deleted records from source Direct queries against Banner Performance issues for others Performance issues of the queries
How this design process went Many changes throughout the process as we discovered things Mantra – BUILD IT IN!
Where do we start? What area do we really want to target first? STUDENT!!! How do we go about doing that? Tables We started with about 25 tables (then later expanded to about 100 tables) Get a list! Build it in!
What metadata do we need? Wait!!! What’s metadata?
What metadata do we need? (continued) Process data How to run Making the process dynamic What ran When it ran How it went (were there errors? what did we get?)
What metadata do we need? (continued) Row-level data When we got it How long was it in source When did it change Is this data current? Has this data been deleted? Are there errors in this row?
The Process – it seems so simple! 1. Get the list of tables 2. Log that the process started 3. Loop of actions for each table in the list: Copy over the data we don’t already have Mark the data that has changed (if any) as no longer current Mark the data no longer there (if any) as deleted Flag any errors 4. Log that the process ended
Copy over the data - For each new table First time load Considerations first: Destination tables – Who wants to do manual creation on all those tables? Build it in to the process! Don’t we need an index? Build it in! Build that in, too! Use the source primary key, plus a unique identifier for the destination row Build it in! Build the Unique Identifier for the destination row into the row-level metadata The good news for new tables: Don’t (yet) need to worry about marking records in new tables as old or deleted
Copy over the data - For each existing table Find the new records in source since last load Insert them into the destination Compare the previous data to what you just added If a record has changed, mark the previous one (by insertion date) as old Compare the ‘current’ (by flag) data in the destination to the source Records in destination but not source? They’ve been deleted
What’s new? How do I know what data is new? Use the source system change data capture (CDC) dates, if possible (or create them in source) GREAT! I CAN USE ACTIVITY DATES!!! They’re already there! WHAT?!? I CAN’T USE ACTIVITY DATES? Compare every piece of data in the source record to the destination (Uggh! Right?) Hash! SQL function: DBMS CRYPTO.HASH( UTL RAW.CAST TO RAW( *** ),3) *** [full concatenated list of columns goes here – omit the ACTIVITY DATE field] Better throw that in the row-level metadata, too Build it in!
How do I know if I have the same record? Source table PK Some Banner tables have up to 9 fields on the Primary Key! Gotta compare them all – Build it in! Some Banner tables don’t have a Primary Key Seriously? Yes, seriously Can we infer a potential primary key? We’ll need it for that index, anyways Sometimes, so let’s capture it, but . Better be safe than sorry If an entire row is the same (by hash), we have it already Otherwise, mark as deleted if not there in source exactly as-is
How do I compare source to destination (and still keep this a dynamic process)? Dynamically created SQL statements Join source to destination This adds to the metadata we need to keep for the list of tables: Source Schema name Source Table name Destination Schema name Number Key fields Key field names (9) Last modified field (activity date) name Is new table? Last Load sequence Last load time Destination Table name Has true source PK? At run-time: List of columns Pull from the source system DBMS system table (Oracle: all tab columns) so that we don’t have to store the whole list for each table SSIS has a maximum field length of 4000 characters – more on this later
Moving this to another database Wasn’t part of the goal to move it somewhere else to keep Banner from having performance problems? How do we keep our dynamic queries with our source-destination joins? Oracle database links – Build it in!
Oracle database links Uh-oh! Oracle database links have a 4000 byte maximum field length (no LOBs allowed) Need to make some sort of special exception here for the tables with potential of these Can we leave the LOB source fields out? When are we ever going to do analytical reporting off of a field larger than 4000 characters? What about when the total characters in the hash is over 4000? - OR What about when the column list is more than 4000 (since we’re bringing that, too)? Are there columns that are larger than we would report off of, but under 4000? Eliminate! Are there columns we have never used that have been around for a long time? Eliminate! Omit the Banner 9 system columns (Surrogate ID, Version, Data Origin, VPDI Code) from this elimination or Common Database Upgrades could make your Data Warehouse server’s brain explode
Flagging errors/warnings What do we want to catch? Duplicate records – more on this later! PIDM 0 (Integration gone wild) Table load error Do we include errored data in our analytics? Flag on the error type code value for whether or not error is a data quality problem
Duplicate records are special snowflakes Duplicated data versus true duplicates Column A Column B Column C Concatenated value 1 2 3 123 12 (null) 3 123 Build it in! Put a character between each column in obtaining the hash This will make the list of column names you can take shorter Are both duplicates (if we have a true duplicate detected) an error? No Only mark one of them When checking for deletions, also need to check to see if records are still duplicates
What if I just want current data? Create views using the Is Current flag Might as well do that dynamically at creation time Build it in!
What happens when Banner adds a column? Process will error We want it to, so we can fix the destination table Add the new columns to the destination table and re-run! Make sure to also add these to the current records view
Logging We’re already recording start and end time of the process Row-level metadata captures the load a record came in on Might as well capture: Start and end time of each table – Build it in! How many records in each load – Build it in! Inserts Updates Deletions Errors – both new and total currently present
What if the process fails while running? Lots of thought went into this SSIS’s default error handling allows to send an email – do that – Build it in! Where do we want it to stop versus move on and keep going? Before it gets to any tables (or after it’s done)? Fail and halt processing entirely Table loop – skip on to the next table and error out the table Table load statement – send full statement in email body text
Getting fancy (a.k.a. Oh crap, it’s RCRAPP) Exception columns: We had to build in some alternate logic to capture: Data in RCRAPP tables GORSDAV VALUE field (ANYDATA data type) A poorly-used keyword-tagging process in a comment field Eliminate columns we don’t have room for that we aren’t planning on using Truncation or substring where a field is too large but we still want it We’ll need to add a yes/no field to capture if this truncated the source value – Build it in! Add another error code for if this field is a ‘Yes’ Type conversion, if the field can be converted
How do we use this stuff? Say I want to do some analysis on the set of currently enrolled students Use the Is Current views! Just like live Banner data Say I want to look at the data we pulled from our IPEDS date last year Use the start and end dates on rows as book-ends IPEDS date is between the record’s Start Date and End Date We’ve built some dimensional model tables off of data in our archive: Dimensions Person Term Major Facts Student Major - Flattened Term-level Academics Student Major – Single Relationship
Future plans More dimensional tables High Schools Degrees Student Attributes Test Scores Address information (?) Admissions Details Future potential data: Student Activities Course-level information Employee/Faculty information
What’s everyone else doing? Group discussion time!
Thank you! I hope you found this informative If you have any questions, now is the time to ask! For more information, please feel free to email me: [email protected] For a copy of this presentation For a copy of our SSIS package and metadata table creation scripts
Primary Keys (that aren’t really Primary Keys) What to include in our destination table index Key values Hash Unique identifier field Metadata fields that will be queried (Is Current flag, Last load, error flag) Do we make our destination table indices a primary key? Oracle doesn’t allow nullable fields in a PK Inferred keys – Tables without a source PK SPRIDEN – PIDM, ID, Change indicator None of these are unique! Change indicator is nullable Change indicator can’t be in the database PK Build it in – put nullable fields at the end of the list and drop off the last field if PK creation fails