Selling Signals - the Data Monetisation Podcast
Selling Signals is the podcast for anyone building, selling, or buying data, with a focus on commercialising data in the investor ecosystem.
Each episode brings together industry insiders to share real, first-hand experience from the front lines of data sales. We unpack what actually works when turning raw data into revenue, whilst exploring other data buying silos to break down the walls between them.
Selling Signals delivers practical lessons to help data teams sell better and build stronger, more commercial data businesses.
Selling Signals - the Data Monetisation Podcast
Valsys: Ticker Mapping Explained
Use Left/Right to seek, Home/End to jump to start or end. Hold shift to jump forward or backward.
In this episode of Selling Signals, we are joined by James's Valsys co-founders, Jack Fuller and Simon Bessey, for a practical discussion on ticker mapping. It is one of the least visible parts of the alternative data workflow, but it has a big impact on whether a dataset can actually be used by the buy side.
We discuss what ticker mapping is and why funds care about it. Jack and Simon explain why the problem is harder than it looks. They show how point-in-time accuracy changes the picture and what good validation looks like. We also get into why ownership changes and M&A make mapping much harder to get right.
This episode is essential listening for any data provider that wants to make its product usable in capital markets.
Welcome to Selling Signals, the podcast focused on how businesses actually monetize and sell data. Each episode, we interview an industry insider to hear their experiences and lessons learned.
SPEAKER_02The series is powered by Valtis, the company that transforms your data into investment-ready intelligence products.
SPEAKER_01If you enjoy the episode, please subscribe from wherever you get your podcasts. Well, this is a rather strange episode for me because uh today we're actually joined by my two co-founders, Jack and Simon, and Eric is uh going to be interviewing the three of us about ticker mapping. So, Jack, Simon, do you guys want to introduce yourselves?
SPEAKER_03Yeah, uh I'm uh Jack, I'm uh COO of Valsys, in as much as that means anything in a company of our size. Um throughout the company's history, I've been variously responsible for many things, um, including compliance, uh, HR, operations, um, accountancy, uh, as well as working fairly extensively during our first couple of products on the UI side. Um, and these days I find myself also building bits of UI for internal tooling um and also working extensively on the ticker mapping aspect that we are going to discuss today.
SPEAKER_04Awesome. I'm Simon, CTO. Um my job's been relatively straightforward in the sense that it's back-end infrastructure, um, and now also some of the sort of more AI side of things.
SPEAKER_02Um yeah, we've been at this for about eight years, and uh still going. Going strong. What makes this episode a bit funnier is that we actually recorded this as the first test episode. We thought it was good, but now we've come on so far that it made sense to redo. So it'd be interesting having the the similar conversations we did, what, eight months ago?
SPEAKER_01Yeah. Was it eight months ago?
SPEAKER_02Maybe that was a stretch. More like six? I don't know. I don't know.
SPEAKER_01Well the time you spent with me just sort of elongates. You've enjoyed it so much.
SPEAKER_02Awesome. Well, why don't we jump straight in? So uh today would be about mapping data to tickers. Why don't we start with what is a ticker?
SPEAKER_01Sure, yeah, I can take that. So a ticker is essentially a stock or a security identifier. Um, and I believe the the term comes from when stock prices were printed on ticker tapes, sort of you know, a hundred or so years ago. Um and so obviously space then was at a premium. You only have so much paper, and to identify a company, they therefore didn't want to print out the full company name. So instead they would come up with these identifiers, which were two, three, four letters. I think I've seen some much longer now. Um and that was essentially the way they would identify the stock so they could print the price.
SPEAKER_03Uh and I think the important thing to note about a ticker these days is it's it's its human readability is one of its main selling points, its main factors. Really good for screens and really good for individual traders to quickly recognize what they're what they're looking at. So if you look on Bloomberg News, you'll see a list of tickers scrolling on the bottom. Everyone knows what double APL is, everyone knows what TSLA is, and and they're very sort of human readable and and quickly easy to quickly grasp what you're looking at.
SPEAKER_04This is true for the European and US exchanges, but some of them are also numbers. Yes. Which are like, which is the only one where I find myself sometimes not being able to read them.
SPEAKER_03So yeah, so in East Asia they're they're they're they're often numeric, but they're still short. I don't think I've seen one more than about six characters, which I guess when we come onto some of the other identifiers that you've come across in this world is still a very uh a very readable, uh human-readable format.
SPEAKER_01Yeah, absolutely.
SPEAKER_02Did they just have much longer ticker paper back then? Yeah. Um okay, well, thanks for uh defining the ticker. Help me understand then why businesses that are looking to sell in the alternative debt industry have to map to tickers. What is that sort of mapping to tickers and why do it?
SPEAKER_01Sure. So fundamentally, it comes down to the fact that if a fund is going to use your data set to analyze securities and figure out areas where they might be able to make uh you know above market uh rate of return, then they need the ability to map or to I should say to identify the rows in your data and how they map to securities. So if we take probably one of the most famous examples of alternative data in the space, credit card transaction data. If you think about, you know, if I in fact, if I get up my own uh credit card transaction data, um, I can see that I've got various transactions, some of them very obvious, right? Some of them are Uber, right? I know what stock that is just by reading, I know what company that is. But then if I go back further through, I start to get slightly more esoteric transactions, shall we say? So I've got one here that says the gym. Now, that could really be any gym. Is it a publicly traded gym? I know which gym it is. Um, but even if I go back even further, I think we've all had that situation.
SPEAKER_03Strange name for a pub, James.
SPEAKER_01Yeah, I know that was in no way me trying to slip in that I've been going to the gym. Um I has yet to appear on my physique, but uh I have been going. Um yeah, so if I go back even further in my transaction history, often you know we've had though everyone's had that situation where you go, I spent£100,£100 on what? Um what is this strange string of letters? And what what on earth was that transaction? So now imagine you're a credit card transaction data provider. You're not just looking at you know James's maybe a hundred um transactions in a month or a thousand in a year, etc. You're looking at millions, probably billions of these things, each of which has its own string, none of which has an identifying company by default. And so what you would have to do is go through and say, okay, all of these ones that say Uber, and including the ones that say Uber Eats, and perhaps ones that say Uber something else, those map to the Uber stock. Uh and if you don't do that, then essentially you're leaving that work up to the fund themselves. And that frankly, if you want to be selling into as many funds as possible, then you need to make your data set as easy to use as possible. And fundamentally that requires you to map your data to stock tickets.
SPEAKER_02Yeah, and I guess to your point where you're saying uh even as you look through your own transactional history and you're questioning what what that£5.80 or that spend was on, if you extrapolate that up to millions of people or even even more, and someone who has no idea who those individuals are becomes really, really hard to identify that computationally right.
SPEAKER_01Yes, and um the the other thing is a lot of this hinges on what the identifier you have in your data set is. So I said, as I said, in the transaction space, the best, the most information you have about where that money was actually spent is this string, the transaction um string. Uh but other data sets may have a domain, they may have a web domain, that is, they may have a company name, they may have a brand name. And all of these are not immediately uh it's not immediate of it immediately obvious when looking at an internal identifier, what stock that might map to.
SPEAKER_04I think it's also for um humans, like there's a lot of things that are easily readable and we form in our head like this idea of an entity that we're able to like easily understand. Like you can say the word apple, and if we're talking in a company context, differentiate it from the fruit, for instance. But with computers, you need just a way of going, all right, map this one-to-one or one-to-many. And with companies, you have so many different ways of identifying things that you end up needing this very unique hash essentially to make sense of it.
SPEAKER_02Awesome. And if we think about funds that or investment funds, head funds that that these providers would be looking to sell to. Is there sort of a barrier to entry completely? Do all vendors have to map to tickers, or is there some funds that will do that on their behalf? Um, can give us some insight into that?
SPEAKER_01I think it's probably fair to say that there are maybe 10, 20 very, very advanced quant funds in the world for whom actually there's a preference to have the most raw data possible. So essentially to haven't had the vendor do almost no pre-processing. That's not because it's not useful to have that pre-processing done. It's because they can do it themselves. They've built systems to map to stock tickers, to clean the data that are either autonomous or at least pseudo-autonomous. Um, and therefore the benefit to having a data set be as raw as possible in the marketplace is that it's only those 10 or 20 funds that can actually use it. Um, so fundamentally, that's really your your market. And if your data is not at a large enough scale, so you know, say thousands, thousands of tickers, then often those quant funds probably aren't that interested in it either. There is a lot of niche data sets who maybe only cover a hundred tickers, and at that point, your sort of the the uh number of funds you can target is vanishingly small.
SPEAKER_03I think the exception there would be if you had some kind of specialist data and you that that was useful to fundamentals analysts in a in that specific context, at which point, obviously, as a as a human being looking at the data and incorporating it into a kind of classic uh fundamentals first model, you can go ahead and sort of almost manually do that that ticker mapping process. So there is that side. I think it's important to make that that kind of quant fundamental distinction as to how that data is consumed. And it comes back to Simon's point of does it need to be machine readable or not?
SPEAKER_01Yeah, and to an extent, I would I would almost throw that question back to you, Eric, because uh at your time, your time at New Data, you must have spoken to tons of providers, tons of funds. In your experience, is mapping to tickers kind of fundamentally the thing every vendor has to do before they open up the broader market?
SPEAKER_02I think you both touched on both of the correct points is that there are funds out there that want to have the edge of having access to data that the long tail of quote unquote less sophisticated funds don't have the resources to be able to assess, implement, and put in product into production. Maybe more of my like new day job. I think uh what I've learned is even though those funds are capable, you still have to join another queue where this data needs to be prepped and ready for trial. What these funds are trying to do is have all these other teams that essentially make trialling and moving into production as easy as possible, and therefore you're just joining other queues and and your sales cycle increases. Um I think the other part of ticker mapping, especially as a new vendor, is great, yeah, you you can open more of the market, but you actually end up learning a lot about your data. And actually, if anyone listening hasn't watched or listened to the the Stuart episode that uh uh that was released a couple episodes before this one, um he talks on he talks about sort of doing that sort of cross-sectional analysis of your data set and understanding where more volumes of the data you're collecting are accumulating around different companies or tickers, where you're strong for, etc. That's really hard to do if you don't have a uh clear concept of what companies that sort of maps, or what data that maps back to. Um and that we talked about the fundamental use case, although they can go back and and sort of you can carve out three companies of data quite easily for a fundamental firm. But actually, if you're thinking about a salesperson, and if you do cover a thousand tickers or five thousand tickers, what tickers do you cover really well? If I go to a fundamental firm and I figure out that these five tickers they really care about, uh, but actually my data isn't that strong, I'm I'm wasting a lot of time as a salesperson. So I think quite often a lot of the the conversation is about uh what does it enable the end user to do, which is great, that's the purpose, but there's a lot where a salesperson and a business internally can understand a lot more about their data and do a lot more analytics and build derivative products from. I think that's important to know.
SPEAKER_01Yeah, I completely agree.
SPEAKER_02Okay, we we touched about how important this is, but maybe sort of to end why is this so important? Do we think this is the the most important step that a vendor can do in terms of their journey into the alternative data market?
SPEAKER_03I mean, after the very fundamentals of having some kind of sort of secure delivery mechanism. Uh so yes, I would say.
SPEAKER_01I would agree. I think that there if we if we go through the the pre-processing steps, it's the it's basically the first one before collecting the data, after sorry, collecting the data, right? So once you have a data set and you believe it's monetizable to investment funds, yeah, it's essentially the next step. So in that respect, it's definitely the most important. After that, there's obviously data cleaning, there's um making sure we've discussed it many times, making sure your point in the point in time element of your data set is there. Um there's finally setting up the delivery mechanism at the end. Um, but I would say, yeah, ticker mapping would be at the top of the list.
SPEAKER_04Without going too much into a rabbit hole, it's also there's a lot of subjectivity to it, just because there's so many different ways of identifying these things. And fundamentally, what you're trying to get to is it's you're never working with just one data set or data source. So you want the reason the identifier is so important in this case is also you want to map, be able to map across all your data sources, um to basically combine them for you know for better analysis or or or whatever. Um, and so as part of this, you've got to take all those subjectivities into account of like, you know, what how are we trying to identify these companies and um put that forward?
SPEAKER_02I think you've started us onto the next section, which you know, how do you solve this problem? But maybe the first question to ask in that is is why is it so hard of a problem to solve? Why why do vendors like yourself exist to to do this work instead of this being done in-house?
SPEAKER_04I think so, for instance, when James first started talking to us about the problem, um I thought uh this will be very straightforward because it's just one company, great, it will find, you know, one thing to identify it and then just like tie that all together. What you quickly find out is there's so many different ways across markets, uh geographies, etc., to do this, and none of them are consistent, that every data provider is essentially ends up with their own unique way of representing the entity. Um, and you the data providers sort of have to do this because there's no you know ubiquitous way of doing it that's consistent from a practical use case. So every API you plug into, um, everything like that will be just a slightly different way of doing it. So from the end user perspective, it's really irritating. Um, but there's no sort of better solution. So you come up with this like mix match of like ISIN's proprietary identifiers, tickers, um, exchange uh codes, or and then you get into you know using company name, the LEI, the CIK, like there's so many different ways of doing this. Um, so trying to combine that all together in a consistent way is one source of the issue.
SPEAKER_03Um I I I think you one thing you said there some really really struck me, which is that I I also thought it would be a not completely uh not not a particularly difficult task. And I I was reflecting on why um just while while you were speaking, and I think it's because coming from kind of internet tech land, um, things that exist in internet land that you pull down from internet land, you can generally go back to internet land to identify, provided you haven't lost too much of that of that metadata because the information is there and and web scraping tools, and we can come on to kind of the the role of LLMs and dealing with unstructured data later, but the those those tools um mean mean that you can kind of go, it's it's it's not completely two-way, but it's it's more two-way. And when we come into some of these data sets that we're dealing with, which are more proprietary, exhaust data, whatever, they are not just on the internet necessarily. And the identifiers that are associated with those are not, uh are therefore not as easily scrapable. So kind of route one for techies, and we see this when we speak to CTOs who are who are being asked to do this job. They have this mentality of, I can just go on the internet and find it because it's companies, and they think Apple, Tesla, whatever, there's tons of information. But there's a lot more um the the the types of data that you deal with don't necessarily map back onto easily findable, scrapable information.
SPEAKER_04I think using your bank statement, the physical version of that is a good one, which is just a receipt, right? So you'll just have like an address, a company name, and then you know, the the information within the receipt. Like if you just have a whole bunch of OCR data like that, like that could be a great data set. But you've got to then tie it back to what the funds care about. And um then you've got to get much more creative as to how you identify which companies they are.
SPEAKER_01And I think it it's almost worth to understand just how complicated it is or subjective in a way, it's worth almost zooming out. So I mean you just definitely touched on this. But there are geographic security identifiers, right? So you've got the QCIP in North America, you've got the C Doll in the UK, there are data provider specific identifiers, so Bloomberg have the Bloomberg ID, FactSet have the F SIM ID. There are attempts to do international security identification. So you mentioned the ISIN, there's also uh OpenFiggy. Um there's also and I'm I've been saying security very specifically here, none of the none of these identifiers, or at least the the the ones I just mentioned, um they all map to a security, right? There are there is no, at least in my opinion, a good identifier that is consistent and international that can be used to identify a company, right, as opposed to the security. Um and there is an attempt, you also mentioned it, with the LEI that is an identifier for the legal entity, but there's complexity inherent in that because you know, let's say two companies merge together, obviously they end up with an maybe a new legal entity, um, or they keep one of their legal entities, um, but let's say it's a you know 50-50 or a 51-49% merger. What conceptually makes sense going back in time um to treat that well, what would you how do you identify those companies? Um and so yeah, I think that's just to kind of zoom out, where it gets really, really complicated. And then you're answering the question of okay, well, how do we solve this? Which is why I think you know describing it as ticker mapping is probably the best way because it sort of hones hones the mind on what you're trying to do. Um, because once you know the ticker and once you know the like the exchange it's on, there you go, you've identified one thing. Um, but actually this is a much broader problem. Um, and everything you guys said about uh figuring out exactly where to map a company name, etc., gets is incredible incredibly complicated.
SPEAKER_04I think but not only the corporate actions as an MA, but also there's things like licensing deals and whatnot that make that also complicated. So a company can go, okay, we will let a company in this other geography through a licensing deal do that. But from an investor perspective, if you want exposure to that stock, um that those revenue figures don't actually map back to the to the company it says on the name. Um and following all that through can be a nightmare as well.
SPEAKER_01I mean, we were looking at an example the other day, uh Dominoes, right? So there's Dominoes in the US, and there's Dominoes in the UK, and they are different companies. Domino's, I believe Domino's in the US owns the brand, licenses it to Dominoes in the UK, but they are separate public companies, different. It get it gets even more complicated though, because Domino's UK used to, I believe, own, was it Dominoes Sweden and Norway? Scandinavia, yeah.
SPEAKER_03Yeah, but now it's always, for some reason, is always unique.
SPEAKER_01Yeah.
SPEAKER_03In an irritating way.
SPEAKER_01But uh but now those those those they've sold those. They are now, I believe, independent private companies, or one independent private company, I'm not sure. So you go, you've got all these things. That are dominoes, but they don't actually map to the same thing. One maps to a US ticker, one maps to a UK ticker, and a few don't map to any ticker, they're private. So it's incredibly complicated. And that's just one example of a very well-known company.
SPEAKER_03Yeah, and and on the Domino's example, if you were building a fundamental model, you would know, okay, you'd you'd read the annual report for Domino's US, Domino's PLC, and you go, okay, we get they license that all the Domino's branding information. They own that, they license it to the various entities around the world. And they charge it. Think about it, it's quite substantial. It's sub 5%, but it might be like 3 or 4% of revenues. It's really they pay a hefty fee. So that you'd put that down as a line item and you would have a mental model and then both a mental model of how that worked and an actual model that bought the exact right amount of revenue into your into your fundamental model. On a kind of quant basis where you're looking to trade algorithmically on this information, the subjectivity starts to lie with the person who is who is doing the mapping. So you don't you could say it's not necessarily complete, it wouldn't, I mean it is, it would be for most use cases wrong, but it's not necessarily wrong if it for it to a machine to say, well, actually, uh was just trying to do the do this recognition. I'm gonna give you the the the dominoes US ice in here because there is that link, but in practice, it's not where most of the revenue exposure is. You want UK or you want what I think is an Icelandic. It's I think it's private anyway. I think it's like an Icelandic P firm or something. Oh really? In in Scandinavia. So there's tons of tons of subjectivity that that comes into this process.
SPEAKER_01You can even end up with situations where you go, okay, this obviously maps to this ticker. Um but actually it turns out that only 30% of that company's shares are publicly traded, and it's owned by a publicly traded parent. So company B is public, only 30% traded. Company A owns company B and it's fully fully publicly traded. Which do you map to? Now, there's obviously stuff you can do there. You can look at you know correlations and stock price and um relative to the data, etc. But ultimately you're gonna have to make a what is a subjective decision at scale, which is going to have programmatic implications. So making sure you do that to the best of your ability is critical and hard. Yeah.
SPEAKER_03And we can kind of fine-tune that on a on a case-by-case basis.
SPEAKER_02What you're touching on really is that you've got the actual a company that isn't or doesn't have any background um experience in this industry, is that there's the actual act of just doing the work of mapping data point A to company A and the steps in between that. But there's also so many nuanced decisions that you have to make that rely on experience of having lived and breathed the roles that you have within funds building these types of products that is really important, really important decisions that when you go to market, you don't have to sort of backtrack and go and make all these decisions. Which actually leads me on to my next point, which is mapping point in time. I think one of the things we didn't touch on earlier is that when I was at New Data, we had a partner product that we would sell, which was ticker mapping. One of the conversations I'd always have was we'd be mapping the data. I don't think most people coming into this industry realize that that sort of historic point in time mapping, as you think about going into trial, is really important. So maybe you can touch on why that sort of historic point in time mapping is is important, but also why you know, sort of why it's difficult to do.
SPEAKER_01Yeah, so I think first thing I want to do is we we've talked about point in time, I think almost every episode so far. But I want to make it clear to listeners that there is a distinction between kind of the high-level point in time that we've perhaps discussed in the past, which is looking at kind of date stamps maybe within the data set and making sure that was the information you had at the time, and then point-in-time mapping in a ticker uh ticker context. So what we mean here is that tickers change over time. So take a famous example. Meta used to have as its ticker FB, and then I believe in I want to say June 2022, or maybe it's 2020. Anyway, uh FB became M E T A, Meta. And what the reason you need to know this is that going back in time, you need to know what the identifier was on the day you were backtesting that data. So if you think about what a fund's gonna do with your data set to evaluate it, they're going to back test it on a historical sample. They're gonna say, okay, let's say we had this data set in 2015. We're going to run a strategy from let's say 2015 to 2018, using that data and see if we can outperform the market. That's only going to work if the identifiers that you have are true as of the sort of historical period in which they're testing, right? This is particularly pertinent when there have been corporate actions, right? So you can't you have to trade the best way to describe this. If there's been a corporate action in 2020 that meant that company A was subsumed by company B, you can't be trading uh or or running a backtesting strategy in 2017 that kind of has pre-knowledge via your like ticker as of today that that acquisition is going to happen. Because you're essentially letting information bleed in. So that's one aspect, but there's also there'll be technical aspects as well. So it it's a critical point and to be just um just to take it back to um make it very simple. Essentially what you need is a history of the ticker throughout the company's life cycle, from ideally when it IPO'd, to then perhaps when it was acquired and the ticker changed, to perhaps when that company, Alameta, decided that they wanted to rebrand because they now owned loads of companies and the original name doesn't make sense, they've got another ticker. Um, and you you need that map to the dates on which those changes occurred because otherwise you're gonna have problems when when running a backtesting strategy.
SPEAKER_03Absolutely. And just just to I just looked in our knowledge base and you were absolutely bang on. It was June 2022. So well done. I would have thought it was 2020 actually. No, June 22. So just James is the knowledge base.
SPEAKER_01That's that's manual QC for you.
SPEAKER_03Um we should come on to manual QC, but what when you uh you you you bring up Meta, it's an interesting example of one where the and just to briefly return to the identifier uh topic, the the ticker changed, and if you're doing a back test and you're pulling your historicals using the ticker, you need to know that ticker change. Um but the ISIN didn't change. So the security, the core security, the ISIN stayed the same. So these are the kinds of things that you you you need to capture um in your when when you are performing these uh these types of mapping exercises.
SPEAKER_01There are also times when the identifier, the ISIN, the figgy, changes, um, but the ticker doesn't change. Often. Yeah. So it's just worth bearing in mind. It's uh yeah, it's a rabbit hole for sure.
SPEAKER_02Well, I mean, before we started the the episode, you were saying that Apple have many, many icins that that that add to the complexity of this company doesn't necessarily just have one identifier.
SPEAKER_01Yeah, I think it's yeah, important to point out that um while a ticker refers to a stock, some of these identifiers, you mentioned ICIN, uh can be assigned to things like bonds, etc. So it's not again, it comes down to the bonds of the same company as it for the city. Yeah, correct. Yeah, yeah, yeah, yeah.
SPEAKER_02Engine QC. I think before we get on that, let's talk about LLMs. Um, because as a before we get on to the QC element, there is in your work uh an element of using today's technology to make this exercise easier. So I'd love to understand how the rise of LLMs and AI has enabled this to be easier.
SPEAKER_04I think you've been super instrumental on a variety of fronts and leveraging them in different ways. Essentially, like because of the sort of zoo of models out there at different price points and uh and they they can be better at different things. Um for example, you can use relatively cheap models for easy sort of like classification in the sense of you already have a data set of information, you're not sure because of, as we've said, all the different ways of identifying a company. So when we're doing search, for instance, let's I'm calling search as in web search across all sources. Um let's take example of you like, you know, you you have a company name as the input and maybe a web domain, and then from that, you know, you'll go scrape the company, go to Wikipedia, the SEC if it's in the US, all of these different things. And the results you get because of the identifier ticker problems and whatnot we've mentioned, can be not referring to the same company. The LLMs are super useful in figuring out which ones you're talking about, but critically you need that sort of input information because the sort of research side of things, at least for most of the models, that is still lacking because of the nuances that we've been speaking about. Now, some of the more expensive models get slightly better, uh, get better at it, but there it comes down to also the cost problem here. So, like, you know, the top-of-the-line models are will be slower and quite expensive to use. Um so depending on the size of your data set, you can't just go for time and money, run across everything.
SPEAKER_03I think I think for a small uh I suppose especially a smaller firm, even a bigger firm, but just blasting the whole data set with with like a Claude 4.6 or a ChatGPT 5.2 Pro would give you workable results eventually and at a significant cost. Um, I think would probably be a fair summation.
SPEAKER_01Yeah, absolutely. And I mean, it's probably uh worth um highlighting the fact that that's why we do it the way we do it, which is to say always use the simplest approach that provides an accurate output. So um initially in our pipeline, the models, the large language models, uh, chat GPT or Claw, they're not involved. Um, and that's because they're expensive, but also they don't need to be. Um and there are certain circumstances where the answer, though perhaps not obvious, is very much answerable in a programmatic, a traditional programmatic way. And we have ways of you know generating confidence scores, and we say, provided the confidence score is high enough, then we're we're happy with that with that conclusion. And only after being passed through multiple different processes first, do we then say, okay, we've got loads of information, but this is a the confidence of our deterministic system is not above the threshold that we require. So what we're going to do is we're going to take all the information we already know because it's been through so many steps, and we're going to provide that to um an agentic model to make this more subjective decision. And that's usually the case. It's usually because it is there is a subjective aspect to this that we require um the model to be brought in. And I will then say that is not the end, because often the models, and again, we have ways of getting these models to generate confidence scores for us, and we do the same thing. If none of the programmatic stuff, both deterministic and non-deterministic, has worked, it goes for manual QC. And that actually has multiple benefits. Number one, uh, it allows us to get it right, um, which I mean you hope the manual QC is 100% accurate. Uh I certainly like to hope so.
SPEAKER_02Um but what it also proclaimed the self-proclaimed.
SPEAKER_01Self-proclaimed. Well, you know, June 2022. It's in the knowledge base. Um the other thing it allows you to do, and you you mentioned this earlier, Eric, is to learn more and more about your data set and more and more about um essentially the the the histories of all these companies which enable you to make more and more nuanced decisions and make sure those are reflected across your knowledge base. And so, Simon, generally, uh it's uh it's um it's not Simon doing the manual QC, he's far too important. Um but we we we did um we did once make Simon uh do, I don't know, what was it? Uh like a set. It was probably a few hundred rows, maybe a thousand rows or something. And I remember afterwards you saying, this was actually really useful. Um because I now am seeing why where the system I've designed and built it has has has gaps. And I I think afterwards you changed the way some of the aspects of the pipeline, right?
SPEAKER_04Yeah, massively, because like from my perspective, usually I'm looking at it as like just rows of of scores, essentially. So I'm like, for this part of the pipeline, what I'm trying to do is, you know, get get the right icing, for example. And so you run that on a subset of the of of whatever data source, and then you see, you know, what percentage of that hits and whatnot, and then you're trying to iterate on that. But when you're going through the manual QC, you're basically just doing it all all out at once in a way that that I hadn't done before. So it was really instrumental in in sort of pattern matching uh where not only where the errors can can show up, but how you as a as a human go solve that, uh, because then you can try and replicate those flows uh programmatically.
SPEAKER_02We kind of spoke about how you guys would do that today today or how a company could do that today. If we go back 15 years ago, 10 years ago, the alternative data industry still lives, uh, and the mapping exercise is still important. Is it just all QC outsource? Is it manual? How is the work done?
SPEAKER_01So I I think it's well, we we went around uh 15, 20 years ago, or at least not in the alternative data space. Um but uh from my understanding is that it was kind of it basically if you take the pipeline that we have now and you chop out all the bits that use essentially machine learning um or um AI agents, because that technology was either very nascent or or didn't exist, so it would be a combination of manual work, deterministic uh algorithms, and and then you know, some fuzzy matching. Uh there are a bunch of algorithms that have been around for for for ages, you know, decades and decades that deal with you know string similarity, and you you can make a really quite effective system using just those techniques. I think that sorry.
SPEAKER_03I I just I don't like I I I think it's you can't really overestimate that the kind of the the quant pioneers in this world, the big funds, the ones we were talking about earlier, who don't need um uh who who don't need things stick a mat because they will do it in-house. Um that that when they were kind of first coming on on the scene back back in the day prior to this technology, there was just a lot of manual labor, um outsourced labor. And we it's known as kind of knowledge process outsourcing or business process outsourcing, but they would have large numbers of people who would go into all the filings, review the filing, uh, both on the on the fundamental side, they'd be just clearing out the LE the SEC, they'd be looking at the kind of private private market filings all around the world, and they would just be doing that job. Um and there was a huge manual effort that went into uh into cleaning and preparing the data for those large funds early on. And we know that directly from the horse's mouth.
SPEAKER_01Well, I think that's primarily a lot of that manual work was occurring places like Bloomberg, right? Because um, and I mean uh the a little anecdote I was told by uh I think uh I think it was a fundamental analyst um who was saying that back in the day, um 10, 20, 20 years ago, let's say, um, if you could find an error in the Bloomberg terminal, Bloomberg would give you, I want to say it was something like 50 or$100. Um and so often it was like a game the analyst would play of going, okay, let's let's pick a company we know incredibly well that's maybe slightly niche. Let's go and find an error in the Bloomberg data, alert them of it, and then lunches on them. And by error in their data, you mean in the mapping? Uh no, it could be it could be um any anything. Um but um I in this example, what I'm really talking about is is something that that Jack touched on, which is getting uh financial fundamental data, say, out of SET filing, so companies' revenue, it's operating income. But more more the reason I'm telling the story is not necessarily uh because it's r directly references ticker mapping, but it just goes to show that even a company like Bloomberg, which is probably gold standard, well, which is definitely gold standard, let's be honest, um, of of kind of financial data, was until I think actually not that long ago. I I don't think this is something that changed like um ages and ages ago, I don't think. Um they were using kind of manual QC, not even done by them, but done by um you know Wall Street analysts say, and paying them$100 for a data point. I mean, that's that's pretty punchy. So it just goes to demonstrate that even they knew, you know, they've probably got these really advanced systems for combing through these filings and getting the data out, but even they knew there's always subjectivity here. And so to the degree to which you can get an expert to actually look at it and actually make a human judgment call about that data, I think that's that's the gold standard in my mind.
SPEAKER_03And on the fun side, that that was true as well. In just to bring it back down around to alternative data, that because especially early on, all these some of these data sets were you know, not everyone was using them, the market was was nascent. Um, and if you could you could get it tickerized and you could start trading on it, there was there was massive alpha to be had there with data sets that are perhaps now derigaur, everybody's using them. So it was worth putting the investment in of hiring a bunch of people to just go through and effectively do what we do in terms of like manual reconciliation, but for the entire thing.
SPEAKER_04I also think you you said 15, 20 years ago, I think the biggest difference over time for determinist deterministic versus non-deterministic um processes is comes down to formatting. So essentially, how consistently readable is this for a computer? Because you have certain data sets like the SEC, which are given in a format that you can read through a computer, which is the X called XPRL, and then you have others that are just gonna be in like a PDF, and that you essentially until quite recently didn't really have a way. I mean, OCR has gotten better, but recently, uh so definitely 15 years ago, you could not be using that um to to get that all out. So it's gone for I think it's gone less and less manual, but I'm constantly surprised at how much of it is still manual, at least checked.
SPEAKER_03Yeah, yeah. And that that era of web scraping is kind of the end, like where where we were, like when we first, well, at least when I first started programming, it was still very much is extremely brittle. And and like people who who could do it well were in were in high demand. It was very, very technically challenging just because of the the un the unstructured nature of of the data you were dealing with.
SPEAKER_02We've talked about sort of hiring out armies of people to be able to sort of do this manually, spoke about sort of getting high-cost LMs to do this. What is the cost to do this uh as you think about doing this in-house and then also then sort of outsourcing it to someone like Valisys?
SPEAKER_01I think the cost of doing it in-house is probably best measured by the amount of time it's going to take for you know one or more of your salaried employees to who probably has no experience of ticker mapping at scale, um, to A learn what it is, how to do it, and and then to and then B to implement it. Um and I I think you know, we've we've spoken to providers who said internal the internal uh estimate for what this was going to cost us would be twenty five thousand, fifty thousand, a hundred thousand, uh even higher, because they said, you know, we think we need three employees working on this for six months to get it right. So the um the internal cost can be significant. When um I mean I'm I'm not gonna uh give A price list from us, but um suffice it to say that depending on your data set, generally we found that we can come in a much more cost effective um price range for um the relative to doing it internally, and also um we certainly um would would like to to say that we we believe we do it a lot more effectively.
SPEAKER_04I think it also comes down to like the actual size of your data set in the sense of the the number of tickers or securities that that you're dealing with, and within that, um because of the corporate actions piece, we're talking about how many subcategories that goes down into. So let's say like you can have uh brands information that ties back to different uh securities over time, but if your total, let's call it, investable universe is small enough, then there is a point at which like if you have a very specific data set on 10 10 tickers, then that should be fairly easy to do in-house. But the more you cover and the more you're um breaking down into specifics, that it the it can be it just gets exponentially harder.
SPEAKER_03I I think there's there's one other point to touch on here, which is and I mentioned it earlier, the knowledge base. So it if we reference the dominoes example, that exists within our kind of corpus of knowledge that we've built up.
SPEAKER_01And sadly our brains.
SPEAKER_03And also our brains. Um so that that exists within that corpus. So then the task, the first thing that we do, and this i in the pipeline, when we see an entity, is we go, okay, and this is a diff very different map mapping matching exercise, but does this match anything? Do we think this matches anything or it was already in our knowledge base? If we see a receipt and it's a dominoes, and then we have the geography of that receipt, we're job done in a way that may not be the case for for somebody attempting that that on themselves. Yeah, I guess both internally or even another vendor.
SPEAKER_01Well, that's why we do the manual QC we do, because um I mean I've I I've said I've said to Jack the point, you know, there must be easier ways to make money um when we've been reconciling a thousand rows and plus. But um the the benefit of doing it is that we do this professionally, right? We're not doing a one and done ticker mapping exercise. We're gonna do ticker mapping exercises for multiple vendors and therefore building that very precise, very carefully quality controlled knowledge base has benefit to us over many years and many contracts. Whereas if you're an individual vendor, it is to a degree a one and done exercise. Yes, you have to keep the mapping up to date, but there's not necessarily the same benefit to you. Put it this way, let's say you're 90%, 80% accurate, the the incremental benefit to you to pushing that like extra 1, 2% is low relative to the benefit to us of pushing our knowledge base to that, like even that 0.1% more accurate. So I I would say that that's definitely uh what one angle of the benefit.
SPEAKER_02Does this evolve then at some point to being a plug and play slash licensing the knowledge base out, or does it always have to be a service?
SPEAKER_01I think that's a question for the future, if I'm honest. Um what I would say initially is that in our experience, every data set this is almost like a click, every data set's different, all unique and wonderful in their own way. But the critical thing there is in order to get the most out of the knowledge base, we need to calibrate, we calibrate our pipeline to your specific data set. And so we've kind of got these these two things. There's the knowledge base and there's the pipeline. The knowledge base is what it is, and the pipeline is able to exploit the usage of that. So at least for now, the best thing to do is for Aventors to come to us, provide us with their data set, and then we can you know make sure that we we squeeze as much juice as possible out of the knowledge base by calibrating the pipeline for their data specifically. A really simple example of what I mean by you know, every like the data set's different is the fact that we, you know, it's this point of what is your internal identifier? Um is it a domain? Is it a transaction string? Now, yes, if two different uh transact credit card transaction data providers came to us, um likely the version of the pipeline we lose we use would look very similar. But perhaps not, or perhaps there would be some significant differences if they're in different geographies, right? So if we're talking to somebody who's got a you know English language uh UK-based transaction data set, that may require different aspects of the pipeline to um a Spanish language uh transaction data set, for example.
SPEAKER_04I do think there's um an interesting angle there in the sense that the more data sets we work with and the more um the we grow out the knowledge base. That can be done across entities. So like companies, people, this, and you know, there's things that that can be really useful for if that's like, for example, in banking, KYC, KYB, type of things where you like having a graph of who owns what um and how those things are linked is really useful, but it is not a current uh right now. It's I think the most valuable thing we can do is continue building out that knowledge base.
SPEAKER_02Understood. Conscious of time, maybe to toile that last bit in a bow, why should a vendor listening use valsys for this type of service?
SPEAKER_03Yeah, I can take that. Um I think the the the first thing I'd say is that and not as employees mind, but we've seen the inside of large funds, we've seen the integration of alternative data sets, and in fact, we've done the integration of alternative data sets uh into these into these organizations. So we we kind of have a really good understanding of what the end users after in terms of data shape, delivery, um, the whole thing. And delivery is something we actually haven't touched on, but getting the timing right of when your data is updated is actually really important.
SPEAKER_01So for some Yeah, yeah, that's that's a fair point, actually. We we we have uh hopefully successfully advertised our ticker mapping services, but we do a lot of other stuff. We we take data sets from essentially raw um all the way through the the transformation pipeline to make them investment ready. Um so that could be everything from ticker mapping to um an exercise we call data profiling. You you actually touched on this somewhat earlier, Eric, which is you know, once you've got the data mapped to tickers, well, where's your coverage geographically, on a sector basis? How has that evolved over time? Generally, when we see a data set come in, if you look at the um coverage throughout the point in time history, the volume has increased over time. So therefore your your areas of coverage may have changed over time. So what we like to do with the data profiling exercise is provide an investment fund um with, or sorry, provide a data provider with a report, or, and this would be, at least in our opinion, the best way to do it, a Jupyter notebook, um, particularly for a quant fund that can be run and can literally just demonstrate every aspect of a uh a data set, uh, essentially up until the point where a fund would then choose to to go away and and and run a backtest on it, um which we fundamentally we very much believe. And I think Stuart mentioned this actually. We don't think running your own internal paper portfolio back test and releasing a white paper is necessarily the right way to go. We think it's much better to do all the heavy lifting beforehand. Um and so yeah, that's another thing we do. And then finally, data delivery. So we can kind of do everything. Sorry, that was a very that was a I went into more detail there than I love data profile.
SPEAKER_03I'm actually I'm actually it it's exactly and finally delivery, which when we've been on uh uh internal on the fun side can cause a whole range of difficult problems, which we should perhaps talk about on a podcast.
SPEAKER_04I think fundamentally though, this is all we do, right? So we've seen more data sets than it than anyone doing it in a house. So we've built up a corpus of which we've called the knowledge base, essentially, which makes it sort of more complete and more correct um than anyone out there. So that's agreed.
SPEAKER_02It's a good way to end it. Well, thanks guys, it's been awesome to have you guys come from behind the curtain to to join in the episode. Um I hope everyone enjoyed listening.