An introduction to using SQL to import data into JMP

1 An introduction to using SQL to import data into JMPSQL...
Author: Gary Anthony
0 downloads 2 Views

1 An introduction to using SQL to import data into JMPSQL & JMP An introduction to using SQL to import data into JMP

2 about: Miles Lane Current Role: Lean Spartanburg Regional Healthcare System Edumacation: The Hills of Northeast Tennessee Wofford College B.S. in Chemistry and Computer Science/Math Likes: Football, any sport except running Growing Facial hair… well, pretty much any body hair… Family, Dog, Wife, Kids (in historical order) Video Games (volunterrier on Xbox Live) Dislikes: The futility of my short legs while trying to train for a 5K Poorly designed databases & slow performing queries

3 … and how do you pronounce it?What is SQL?

4 Structured Query LanguageFrom Pronouncing SQL: S-Q-L or Sequel? Initially developed in the early 1970s at IBM by Donald Chamberlin and Raymond Boyce, what we know as SQL today was initially called “Structured English Query Language” (SEQUEL) and pronounced “sequel”… as a bit of a pun since it was the sequel to QUEL. Microsoft has no official guide, but the majority of Microsoft SQL Server developers say “sequel”. Oracle’s docs say “sequel”. MySQL says it is officially pronounced “My Ess Que Ell”. Wikipedia says it is officially pronounced “S-Q-L”. ANSWER… there is no correct answer! Tomato / Tomahto Potato / Potahto S-Q-L / Sequel both work! Personally, I almost exclusively say “sequel”… How do you pronounce SQL?!?

5 Structured Query LanguageNon-procedural, Set-based language Describes what you want to get instead of how you want to get it: SELECT * FROM movies Do I, as a user, really care how the backend database gives me this data (table/index lookup/scans, hash matches, etc. etc.)? This is in sharp contrast to most procedural programming languages. Note that just because you can use SQL to ask for everything does not mean that you should! Client resources can be overloaded (i.e. Excel/JMP/etc. might/will crash while trying to load millions of rows of data). Server resources can be overwhelmed by multiple bad queries / inconsiderate users performing overly aggressive queries. The underlying design and/or optimization of the database may not provide adequate performance for the question you are asking. SELECT Year, COUNT(movieId) AS Movies FROM movies GROUP BY Year ORDER BY Year DESC Optimizing queries and becoming intimately familiar with how each backend database platform “makes its sausage” can be pretty daunting. ANSI SQL is an ISO standard that we can feel comfortable using to get most of our job done without too much worry… What is SQL?

6 MovieLens.org dataset SAMPLE DATABASE

7 MovieLens.org Used the “MovieLens Latest” Dataset ~22,000,000 million ratings ~580,000 tag applications ~33,000 movies ~240,000 users Last updated 01/2016

8 CSV Files Movies Links Tags/RatingsGenres are delimited text Year appended onto some (not all!) movie titles Links Pointers to MovieLens/IMDB/TheMovieDB websites Tags/Ratings Potentially many per movie UserID is anonomized Timestamp stored in seconds since 1/1/1970 Ratings.csv file way too large for Excel M rows!!! If file contains more than 1,048,576 rows or 16,384 columns, Excel will not like it.

9 MovieLens on SQLExpressCreated MovieLens Database in SQL Did some SQL cleanup: Imported CSV files into SQL via the Tasks>Import Data… option (DTS) Made sure to use nvarchar fields to handle UTF-8 character data for all the Unicode characters Normalized the Genre data Separated Year column from the Title Converted integer date to SQL tagdate/ratingdate

10 Using a sample MovieLens.org database to look at movie dataSQL Examples

11 SELECT * is OK for smaller tables… but it obviously get a problematic when the tables get larger… and will eventually crash the client (if you are lucky) and the server (and you will be in trouble!).

12 Note we CAST the title and AVG statements for formattingWhat are the 10 most frequently rated movies, and their average rating? Note we CAST the title and AVG statements for formatting We JOIN the movies with their ratings by movieId. Group by title, then Order by Number of ratings (column #2), descending Use SQL Server’s “TOP 10” phrase to limit our result set to 10 rows. This query shows the 10 most rated movies by total number of ratings. We cast the title to nvarchar(50) to limit its column width on the screen. We cast the average rating column to limit the decimal places and make the number a bit more readable.

13 What are the Movie Ratings by Genre?Using COUNT (DISTINCT mg.movieId) to return unique movies per genre Uses the resolution table moviegenre to relate genre list to specific movies and their ratings. Note there are more aggregate movies here (64,596) than total movies (34,208) since most movies have more than one Genre applied.

14 Pulling SQL data into JMP

15 Importing ODBC data into JMPFile > Database > Query Builder… -OR- Open Table…

16 Open Table… Useful for small/medium tables, or defined data sets.Create/Select an ODBC connection, schema and specific table. Clicking Advanced… gives the dialog to the right that will allow you to choose a Stored Procedure or make additional tweaks to the SQL statement.

17 Query Builder… Watch out when you try to preview a Join!

18 Note that aside from query

19

20 When you’re ready to get fancy…Convert to Custom SQL and you can go nuts!

21 QUESTIONS?