Skip to content

Explore Courses | Elder Research | Contact | LMS Login

Statistics.com Logo
  • Courses
    • See All Courses
    • Calendar
    • Intro stats for college credit
    • Faculty
    • Group training
    • Credit & Credentialing
    • Teach With Us
  • Programs/Degrees
    • Certificates
      • Analytics for Data Science
      • Biostatistics
      • Programming For Data Science – Python (Experienced)
      • Programming For Data Science – Python (Novice)
      • Programming For Data Science – R (Experienced)
      • Programming For Data Science – R (Novice)
      • Social Science
    • Undergraduate Degree Programs
    • Graduate Degree Programs
    • Massive Open Online Courses (MOOC)
  • Partnerships
    • Higher Education
    • Enterprise
  • Resources
    • About Us
    • Blog
    • Word Of The Week
    • News and Announcements
    • Newsletter signup
    • Glossary
    • Statistical Symbols
    • FAQs & Knowledge Base
    • Testimonials
    • Test Yourself
Menu
  • Courses
    • See All Courses
    • Calendar
    • Intro stats for college credit
    • Faculty
    • Group training
    • Credit & Credentialing
    • Teach With Us
  • Programs/Degrees
    • Certificates
      • Analytics for Data Science
      • Biostatistics
      • Programming For Data Science – Python (Experienced)
      • Programming For Data Science – Python (Novice)
      • Programming For Data Science – R (Experienced)
      • Programming For Data Science – R (Novice)
      • Social Science
    • Undergraduate Degree Programs
    • Graduate Degree Programs
    • Massive Open Online Courses (MOOC)
  • Partnerships
    • Higher Education
    • Enterprise
  • Resources
    • About Us
    • Blog
    • Word Of The Week
    • News and Announcements
    • Newsletter signup
    • Glossary
    • Statistical Symbols
    • FAQs & Knowledge Base
    • Testimonials
    • Test Yourself
Student Login

Blog

Home Blog Confusing Terms in Data Science – A Look at Synonyms, Homonyms and more

Confusing Terms in Data Science – A Look at Synonyms, Homonyms and more

To a statistician, a sample is a collection of observations (cases). To a machine learner, it’s a single observation. Modern data science has its origin in several different fields, which leads to potentially confusing homonyms and synonyms, like these:

Homonyms (words with multiple meanings):

Bias: To a lay person, bias refers to an opinion about something that is pre-formed in advance of specific facts. As consideration of ethical issues in data science grows, this meaning has crept into discussion of the fairness or social worth of machine learning algorithms. But the term has a more narrow definition in statistics – it refers to the tendency of an estimation procedure, or a model, to arrive at estimates or predictions that are, on balance, off target.

Confidence: To a statistician, confidence measures sample reliability (we are 95% confident that the average blood sugar in the group lies between X and Y, based on a sample of N patients). To a machine learner, confidence can refer to a metric used in association rules (“what goes with what in market basket transactions”), one of several measures of the strength of a rule.

Decision Trees: To statisticians and machine learners, “decision trees,” also called “classification and regression trees” (CART), is a term for a class of algorithms that progressively partition data into chunks that are more and more homogeneous with respect to the outcome variable. The result is a branching set of rules applied to predictor variables to predict the outcome. To an operations research specialist, “decision trees” are a representation of progressive decisions and possible outcomes, with probabilities, plus costs/benefits, attached to the outcomes. The path ending in the highest expected value then guides decisions.

Graph:

To a lay person, a graph usually means a visual representation of data, which statisticians more often refer to as plots and charts. To computer scientist, graph refers to a data structure of entities’ ties and links between them. Speaking of graphs, Wikipedia has an interesting Euler diagram of homonyms, synonyms, homographs and their cousins (right).

Normalize: In statistics and machine learning, to normalize a variable is to rescale it, so that it is on the same scale as other variables to be used in a model. For example, to subtract the mean, so it is centered around 0, and to divide by the standard deviation, so that it has a consistent scale with other variables so normalized. In database management, normalization refers to the process of organizing relational databases and their tables so that the data are not redundant and relations among tables are consistent.

Sample: In statistics, a sample is a collection of observations or records. In computer science and machine learning, sample often refers to a single record.

Synonyms (different words for the same thing):

Record: The prevalent non-time-series data format is the spreadsheet model, where each column is a variable, and each row is a record. So a row might represent a patient, for example, and the cell values are measurements on variables. Statisticians will also call the record a case, or an observation. In computer science, the terms instance, sample, or example might be used.

Prediction: In statistical and machine learning, prediction is the use of a model to predict individual outcomes on the basis of known predictor variables. The term “estimation” is also used, though its use is generally limited to numeric outcomes (as opposed to categorical or binary). In statistics, estimation more often refers to the use of a sample statistic (say, the mean) to measure something, and we want to interpret this measurement as representing a larger population.

Predictor variable: In computer science and machine learning this can be called an attribute, input variable or feature. In classical statistics, the term “independent variable” is used, and in database management the term “field” is applied. In artificial intelligence applications, models must typically start with very low level predictor information, such as pixel values or sound wavelengths. The term “feature” is used here to mean more than simply a given predictor variable, but also to the process of developing aggregations of low-level predictors into more informative “features” (also called “higher level features.”)

Data partitions: In predictive modeling, models are trained on data where the outcome is known. To assess the performance of those models, a portion of the data is set aside and the model is used to predict values that can be compared to the known values in this set-aside data. Sometimes, particularly where there is a lot of iteration between the set-aside and the training data to “tune” model parameters and select the best model, a third set-aside is used just to predict how well the model will do with new data. These set-asides have different names, not necessarily denoting which function they are serving: holdout data, test data, validation data.

Recent Posts

  • Oct 6: Ethical AI: Darth Vader and the Cowardly Lion
    /
    0 Comments
  • Oct 19: Data Literacy – The Chainsaw Case
    /
    0 Comments
  • Data Literacy – The Chainsaw Case
    /
    0 Comments

About Statistics.com

Statistics.com offers academic and professional education in statistics, analytics, and data science at beginner, intermediate, and advanced levels of instruction. Statistics.com is a part of Elder Research, a data science consultancy with 25 years of experience in data analytics.

 The Institute for Statistics Education is certified to operate by the State Council of Higher Education for Virginia (SCHEV)

Our Links

  • Contact Us
  • Site Map
  • Explore Courses
  • About Us
  • Management Team
  • Contact Us
  • Site Map
  • Explore Courses
  • About Us
  • Management Team

Social Networks

Facebook Twitter Youtube Linkedin

Contact

The Institute for Statistics Education
2107 Wilson Blvd
Suite 850 
Arlington, VA 22201
(571) 281-8817

ourcourses@statistics.com

  • Contact Us
  • Site Map
  • Explore Courses
  • About Us
  • Management Team

© Copyright 2023 - Statistics.com, LLC | All Rights Reserved | Privacy Policy | Terms of Use

By continuing to use this website, you consent to the use of cookies in accordance with our Cookie Policy.

Accept