Wine and Data Science

Hi all,

Over the last few months, I have been exploring ways to combine my interest in wine with another passion of mine: data. I have written a few articles (two example links at the bottom of this message) about ways to better understand wine through the use of data science techniques.

I’d love to hear what you all think about this approach to studying wine. Are there other aspects of the wine creation, tasting or selling processes you think might be worth exploring in quantitative ways?

Looking forward to hearing your thoughts!


Food and Wine Pairing

Building a Wine Recommender Model

As discussed in another current thread, you face the challenges of different consumers/users having different personal definitions and perspectives on specific wine characteristics, as well as different preferences and no consistency in alignment between what they say/think they like vs. what they actually like.

In fewer words: it won’t work.

However, if done well it may succeed as a marketing tool even if it fails as a tool for identifying wines people actually like.

Thanks David, appreciate your thoughts! I tried to mitigate this effect by only looking at reviews by experienced or professional wine reviewers from I have excluded any reviews by individuals who only have a handful of reviews in my dataset. The thinking there was that more experienced reviewers will have a more consistent and accurate way of describing the texture and flavor profiles of wines… of course, this doesn’t remove all of the subjectivity so your point is well-taken.

Welcome, Roald.

You do face a quixotic task, but grouping data for general conclusions seems a fine thing!

There have been wine services that tried this in the past.

Vino Volo used to place a spot on a flavor/characteristic wheel, as well.

Your style of map may be superior. The tough part, of course, is the move from “subjective impressions” to “data.”

Please keep coming here. If I can think of anything useful to add, I will try.

I understand your approach Roald, and the value of using experienced tasters as your data source. The problem is that the majority of consumers are not experienced and may or may not align with them.

Not to be a total wet blanket, this sort of thing can get people interested and willing to explore, which is more than half the battle.

1 Like

Thank you Anton! The picture you attached is very informative… wondering whether there are any lessons I might take from the two dimensions (fruit vs. complexity) along which the wines have been represented. In fact, complexity is not something I have modeled explicitly in any of my work so far. Do you think this needs to be addressed?

David - I have wondered about the problem of consumers not being able to articulate what they are experiencing when they drink a glass of wine. An expert might be able to pinpoint alcohol levels, acidity, tannin and an array of aroma compounds in a consistent way but this takes years of experience. I do think, however, that anyone can have a visceral reaction to a wine they try. They will like or dislike a wine but might not understand why. This is where the professional reviewers can help ‘explain’ the characteristics of a wine and give the consumer the vocabulary they need to better understand their preferences. You like this Pinot Noir from Santa Barbara? Well, these are the attributes of that wine based on what some professionals had to say… and here are a couple of things you might like too based on that.

I agree Roald. If you start with a list of wines someone already likes, you have a pretty good shot at being able to recommend others they’ll like. It’s much harder to do that with a list of descriptors alone.

I think the really good parallel to this would be thinking about a wine recc algorithm similar to how pandora works with music. There are a lot of parallels with how much both music and wine involve personal taste, and you can build a web of connecting characteristics that would link one wine to another just like you can with music. Then people would just input wines they enjoy and hopefully each would help narrow down potential recc choices

Roald, thanks for sharing the articles. In general I really like your approach. I’ve worked in data science though and perhaps I’m more open to these methods. Working with large scale data sets of subjective (i.e. human) experience descriptors seems to me like an extremely valid way to try to ‘define’ things (like wine descriptions) that are hard to define. Food matching is especially rough for people, so anything that suggests a quick, reasonable solution would be pretty handy.

I applaud your responses to the naysayers above. Since it’s the holiday time of year, I’m going to take a non-humbug approach. One thought is to include price in your data sets. Wine Searcher has lots of this info, and I think most people who used this as a tool would want to see suggestions in a particular price range. Might be fun to answer the question “how much of a better match would I get if I spent $10 more?” As a sales tool in a wine store, I could see these being really useful as well, especially if it was easy to import a stock list so a sales rep could use these to suggest wines in the store.

Thanks for the helpful responses Matt and Rich!

Matt - I love the idea. This could be a natural extension of the wine recommender model I have built. I’ve even thought about the concept of having people try a couple of different wines that exhibit certain characteristics and then extrapolating their preferences into a rudimentary profile of wines they might like. Over time, you could collect more feedback and provide more sophisticated recommendations. I think there are some existing apps that do this, but I think the typical approach is to say something like ‘if you like Oregon Pinot… you might like this other cool climate Pinot’. Would be cool to go beyond this using the methodology I have explored and to recommend wines based on a more sophisticated breakdown of their sensory profile. Do you know of any other apps/companies that currently do this?

Rich - appreciate the support! And including price in the analysis is definitely possible. I have this as part of my dataset already, so all I’d need to do is build some sort of a filter for the price of a wine. Your idea about this as a tool for a sales rep is really intriguing… it triggered a thought. One of the things I haven’t yet managed to quantify is some notion of quality-related attributes in a wine that are likely to be closely related with price. I’m thinking factors like balance, intensity and complexity. These factors might be more subjective than any of the aroma or nonaroma attributes of a wine, but might be pretty important. Grand Cru Puligny-Montrachet likely shares many of the same core flavors as an $8 Chardonnay from another part of the world (apple… citrus… etc), but has characteristics that make for a more pleasant drinking experience.

Do you think it would be worthwhile trying to parse out these more intangible quality characteristics? Then we could provide a much more sophisticated answer to the question you raised re: the tradeoff between price and quality.

the thinking there was that more experienced reviewers will have a more consistent and accurate way of describing the texture and flavor profiles of wines…

Why would you assume this? Seriously. That’s not necessarily true. About 20 years ago there was some other guy who tried to do this. He took various reviewers and plotted them to come up with some way of telling whether a given taster was consistent or not. Some were, others were basically random. I don’t remember the guy or the exact year. We did exchange some e-mails regarding the study, which I found pretty amusing but of no real practical value.

Really, one of the best ways of getting your data already exists - Vivino and Cellar Tracker. You get data from a wide range of people and you can parse out some kind of average. That would be more useful than trying to understand a small handful of critics. Not everyone can articulate what they’re tasting, but they can state whether or not they like it, and you can correlate those scores to some body of descriptions.

I would love to run the models I have built using Vivino and Cellar Tracker data - but unfortunately for me their data is not available for public use (the winemag data I had found was available from a repository of existing datasets). I think you’re right in that a large number of reviews will likely make up for any shortcomings of individuals who are insufficiently experienced to articulate their sensory experience, and that averages & common descriptors across a crowd of people will likely yield results that are of a similar or high(er) quality than looking at individual wine reviews by a single reviewer.

I could take your point a step further - in addition to any individual biases, a single wine review is unlikely to contain a sufficient amount of information to portray the full array of attributes of a wine. At best, a wine review is 5 sentences long. There’s just no way it will contain an exhaustive list of all relevant descriptors that pertain to a wine. I have done something that I think controls for this: in the most recent article (link in my first post above) I have grouped together wines with the same varietal and from the same subregion (e.g. Sauv Blanc from Marlborough). This is likely to contain reviews from multiple different people covering a wide set of relevant descriptors, so will control for some of the inconsistency between reviewers.

This doesn’t get us to a balanced set of reviews at the individual wine level, but is an attempt to control for individual biases and incomplete reviews.

I might be a little more optimistic than you about the value of the information contained within professional reviews, even if sample sizes are (fairly) low. There are some great papers that have been written to examine the consistency of descriptor usage between reviewers (for example and how it correlates with perceptions about the quality of wines. The evidence in this paper does suggest that there is a high degree of consistency in the words that reviewers use, and how this relates to their assessment of the quality of a wine.

I’m interested in the last few comments. I’ve always thought the public comments on Cellartracker are a great avenue for someone to collate the data into an AI program. There are dating apps for instance that use your Facebook friends to suggest matches for dates. I would see if there is an opportunity to use similarity in ratings, likes, keywords (maybe based on similarity of tasting descriptions) to suggest friends and wines. Are public comments on Cellartracker copyrighted? I would think it falls in the same domain as Twitter or Facebook comments where people can use and reuse public posts for personal consumption. Gathering all of their tasting notes without tipping the site off to aberrant behavior may be a separate issue though.

For professional tasting notes, I think people rely more on pros that seem to rate wines similarly. For things like futures, you’re really at the mercy of ratings. My personal feeling is similar to yours in that I think any one reviewer can be off, but the totality off all reviews tend to cluster around the norm.

Wine and food pairing is so complicated. I don’t think you’re using a good source for information on that. This being the first rule makes the whole thing highly questionable (even though I agree with some of the points made).

(I) The body of the wine should roughly match the body of the food.

Also, if the wine should be sweeter than the food (if there’s sweetness in the food, I generally agree with this), how did you come up with a Chardonnay going with peach pie? The Chardonnay wouldn’t have more than a small amount of residual sugar.

Am I interpreting correctly that acidic wines shouldn’t go with bitter foods? That doesn’t make sense either.

I think this is all interesting on both fronts, but impossible. Thanks for posting, though, and I hope you stick around. If nothing else, these are interesting ideas to think about.

NoriY - I’ve done a bit of research and Cellartracker is (rightfully) quite cagey about the use of their data in third party applications. As far as I could see there also isn’t a public API that would allow users to query the CT database. That being said, this thread has inspired me to get in touch with the folks over at CT. Maybe they would be interested in seeing some of the methods I have used applied to their own dataset.

I also had another thought reading your comment. You mentioned that wine aficionados look for reviewers that rate wines similarly to them. Would it be worthwhile to build an algorithm that can show wine enthusiasts which professional reviewers align most closely with them, both in terms of raw ratings as well as the type of vocabulary used to describe a wine?

Doug - all fair criticisms. I tried to lay out a set of basic pairing rules based on a few different online sources. By no means perfect but keep in mind they are adjustable. The rules have some flexibility built in, so that is why (for better or for worse) the Chardonnay is being paired with the peach pie.

Overall I do think that there is science behind food and wine pairing. If that is true, there must be a set of rules that can be devised to come up with good pairings based solely on data. Although I might not have nailed the approach on this first pass I do think this is a promising area for more work. It might be a matter of more data (going back to the CellarTracker and Vivino points mentioned in this thread) or alternatively a more exact way of extracting information from the reviews. Alternatively, more finetuning might be required on the application of Natural Language Processing techniques on wine vs. food reviews themselves (words meaning different things in the context of food and wine). Either way… I’m not giving up on this idea quite yet :slight_smile:

Hi all.

CellarTracker is copyrighted. We do not have an API. We have actually had a lot of problems with academics attempting to harvest large amounts of data for their own analysis and which they later loaded into the “public domain” against our terms of service. This was the worst example: SNAP: Web data: CellarTracker reviews

It actually took multiple cease and desists to get them to comply, and a few years later they even “restored” it one more time for good measure.

1 Like

fascinating thread for sure. professionally, i think about these things all the time. Roald, i’d love to chat directly if you’re up for it.

Eric - thoughts on the linkedin scraping case and how it might apply to CT?

I was not familiar with that case, but it is consistent with the the guidance I got back in 2011 when Snooth was scraping me. There are very few legal remedies in the US for scraping and basically no legal protections of “factual data.”

Europe on the other hand has severe penalties and significant protection. I could move my business to Europe, but what I have opted to do instead is have pretty severe anti-scraping countermeasures in-place that stop the bulk of the issues.

Roald, I think they are trying to do something different than you are, but here is another team of data scientists working to apply their craft to wine:

Their approach is more about aggregating and then normalizing critics scores (not words), and determining the level of consistency for a specific wine.

I’m afraid you’re putting far too much faith in “professional” critics. If you read notes on the same wine from different critics, you’ll usually find very little correlation in their descriptors. If you read a lot of tasting notes, you’ll find each critic has his or her own set of favorite terms that are used over and over, across different wine types.

Here are some examples I posted in a thread about critics and their notes. Note that these are all well-established, experience wine writers:

2014 Beausejour Duffau Lagarosse

An utterly spellbinding wine, the 2014 Beauséjour Héritiers Duffau-Lagarrosse is also one of the unqualified successes of the vintage. Beams of tannin give the 2014 its ample, broad feel. Inky > red cherry, blueberry, smoke, leather and tobacco > fill out the wine’s big frame effortlessly. Layers of intense fruit meld into a huge spine of tannin in a vertical, > massively structured > Saint-Émilion. > So many 2014s are charming and accessible, but this is not one of them. Readers will have to be patient. > - Antonio Galloni, Vinous Media

Licorice, sweet oak, thyme, flowers, plum and assorted pit fruit > [> peaches? apricots?] > make an entrance. On the palate, the wine has a polish to the tannins, sweetness to the fruit and a stony refinement in the finish. This is a vintage of Beausejour to > drink young> , while waiting for the 09, 10, 12, 15 and 16 to come around. Very fine for the vintage. - Wine Cellar Insider (Jeff Leve)

This is one of the best examples of this wine that I have tasted, reaching the same heights as some of the biggest names in this vintage, and barely a step down from 2016 - great stuff from these guys this year. It’s firm, bright, intense and deep, with > salinity> , grip and a lovely seam of freshness. It has a really excellent, juicy character and good persistency, with notes of > liquorice and dark chocolate. > - Decanter

While I wasn’t able to taste the 2015, the 2014 Château Beausejour Duffau-Lagarrosse is fabulous stuff and well worth seeking out. Made from close to 100% Merlot (there’s a splash of Cabernet Franc) and offering classic notes of > damp earth, tobacco leaf, blackcurrants, and beautiful minerality> , this beauty hits the palate with medium to full-bodied richness, a terrific core of fruit, and more texture and opulence than most in the vintage. > It will keep for 20-25 years. > - Jeb Dunnuck

This is full of muscular > graphite and tobacco > notes, holding sway over a core of slightly exotic > mulled fig > and warm > black currant sauce> . A ganache edge lines the finish, but a pure fruit detail echoes longest. This will be exceptional when the elements meld fully. > Best from 2022 through 2035> . 1,335 cases made. - James Molesworth, Wine Spectator

So layered with a lovely richness of > chocolate, wet earth and spices, > not to mention > plum > character. Full-bodied, tight and focused. > Needs five to six years to open, but it’s a structured and beautiful wine already. > - James Suckling

Tasted blind. Lively, well-balanced and well-behaved nose. Thick and confident. Lots of length but a bit > Oxford marmalade-like> . Overall satisfying though. Youthful. - Jancis Robinson

Trying to build a useful model based on garbage data like this is a lost cause, I think.