Blog

Webinar - We build a new cloud and this got sc... arily out of hand Vol 1

Picture of Michal Kuřátko</br>IT Specialist

Video only available in Czech

We beautifully named the cloud project CLOUDPOINT 2022, which is the name of our main service and which in itself says that when we finished it recently, it didn’t quite work out.

Transcript

And what worked and what didn’t. What worked is clear, we built it, but what didn’t work are the things we encountered along the way. These are things that can happen to you on your projects. So we wanted to share with you what happened to us, what to watch out for, what to avoid it, so that you don’t end up in the same trap and have to deal with similar problems. And if nothing else, other people’s misfortune always pleases, so I suppose it will at least be fun for the mischievous.

To explain what the project was about and some of the problems we encountered afterwards, something needs to be said not about the company, but about why we actually did the project and what the project meant.

The Cloud Provider provides computing power services. We have some history, but as a company we have been on the market since 2020. We provided our services to customers on leased infrastructure and leased solutions. Because over time we got to the stage where we calculated that it would be more profitable for us to build and run the infrastructure ourselves, so we just started putting together this project, the goal of which was to build a cloud infrastructure to provide virtual computing power and our own hardware.

Originally, we expected this project to expand the resources we had available, leased, from our provider, and just add value in the form of newer technologies, higher capacity and faster speeds, etc. And it was also a more profitable solution for us , than if the then provider was preparing it for us.

So that was why we started doing it. The project began to be conceived at the end of 2021, we planned it beautifully, as it always turned out on paper. So we wanted to start in 2022 and have it done in November 2022. It did not work out.

Of the planned eleven months of the project – we already had experience with this type of project, we knew what to expect – so those eleven months should have been fine with a margin, but it didn’t work out. In total, the project took 16 months in the final. And we basically finished it in March 2023.

Some of the problems that arose were predictable, most of them were not. The predictable ones that surprised us a bit, one of the biggest being the unavailability of hardware. We all know how it was in 2022, there were no processors, no disks, no memories. However, we knew that we were entering a project to acquire hardware infrastructure at a time that was not quite suitable for it. On the other hand, we were building the infrastructure with what was available and leaving a lot of freedom in what technology we used for it.

Our company tries to avoid any kind of vendor lock-in, whether it’s in terms of hardware or software infrastructure, so we’ve done a pretty competitive bidding process, asked all possible vendors and manufacturers, and basically left it up to them to offer us something that they are able to deliver and what can be found on that market in a reasonable period of time.

Where it didn’t work a bit, it’s not that we couldn’t buy anything else, but our technicians, especially the network ones, are in love with a certain technology, so they prefer it, and in principle there’s no reason to get rid of it, so we had a little more complicated in the networking, because we didn’t want specific elements there, of course there was some freedom, but we wanted a specific vendor that we are used to using. That’s exactly where it pissed us off – several times – but let’s say that at the beginning, when we were dealing with the deadlines, after we had chosen what elements they would be, the supplier informed us that the waiting time was only 340 days.

We survived this too and a solution was found for this. What is the overall common thread of our problems and their solutions is that we fortunately chose good partners, either suppliers or manufacturers’ representatives, who were able to help us with our problems and somehow find a solution that moved the project forward.

If we are specific, before we talk specifically about the technical problems, I would like to mention the project problems, more fundamental ones, because they had an impact on the overall solution.

When we were building the infrastructure, we were preparing the technical solution for the infrastructure, so we were based on some of our experience from previous operations and, of course, we projected modern technologies and trends into it. But overall, we expected that we are expanding the capacity of the existing solution we have and basically adding our own resources there.

The first big change was that during the implementation of the project, the management came to the opinion that it would be better if the hardware stack was not built as part of the existing solution, but would be built as an independent solution without ties to the original cloud. Then we would be able to offer redundant solutions in terms of backups, load distribution, etc. Of course, business people came up with this, technicians would not have thought of this. And actually it wasn’t a bad idea.

For us, it meant redesigning the hardware stack, before

especially regarding netting. The network operators added a lot of new toys there, it was recalculated, the delivery dates were not shortened, we just had to look for it in other countries, even outside of Europe, but it was solvable and acceptable both from the technical side and from the management and business side .

The worse thing was that the further the project progressed, the less and less communication went well with the existing provider of our services and he didn’t like us doing it at all, and the end result was that the management decided that we would not do it in the existing location and that we will do it completely by ourselves and in a completely different place. On which? Choose was the answer. Put it where you want, mainly, whatever it is. Just to give an idea, this decision came in October, a month before the original launch date.

We bought the hardware through tenders and through some problems with deliveries, which were not so significant for the servers, for example. We moved by about 14 days, it was more complicated with that network infrastructure. But there are ways and solutions to do it, to build it and have the facilities available. And basically, when we had all the hardware together in mid-October and started planning how the colleagues would go about it on their lunch breaks, the decision came, yes, go for it, but not in this location, find another one.

So another data center was looked for – you can probably imagine what all that means. We had 70 nodes that we needed to get in some data center, get power for it, connectivity, public IP addresses that we didn’t have and don’t have, solve the LIR. And to do all this so quickly that the guys could start assembling and putting together the hardware somewhere.

These are the biggest problems and concerns that concerned me – the project manager – that I was in charge of to ensure the conditions for implementation. So this was kind of the first package of what ended up being successful, but rather of the complications that came with that project.

I knew from the beginning that there would be complications, no project ever goes according to the planned schedule, if you follow the schedule for at least a month, then it’s a miracle. Especially this long. So I expected that. With the fact that my tender conditions will change – twice and in this way, that the whole thing will have to be rebuilt – I did not plan that at all. But it was resolved.

However, a similar problem arose in the technical part. Unfortunately, by God, it’s always the case with technicians that if you give them a lot of time to think about things, they will come up with something new, better, faster, etc. (Modern technologies work better, after all). And eleven months is an awfully long time for that.

So, by the time we set the budget, selected the hardware supplier, selected the infrastructure and waited for it to arrive, the engineers decided that it might not be a bad idea to modify the hardware infrastructure for the storage solution.

We had a storage solution built on software-defined storage of CEPH technology, and the original idea was that the servers would contain first-tier disks, the second group of second-tier disks, each would have a separate CEPH, which would be connected by one redundant 100 Gbps line. Sufficient solution.

Unfortunately, before all of this was delivered and implemented, it was decided that it might be better if there were two redundant lines, so 200 Gbps. As a designer, you have no argument for that. It will be faster, safer, more stable and cost less. Good. That was the minor problem.

The bigger problem was that we let them think about it for a really long time, and at that moment it occurred to them that the tiering of disk arrays could actually be solved not by separate servers for each category, but by dividing the tiers within one server, expanding the CEPH across all servers. Throughputs would increase, latencies would decrease. And on that occasion, they would be able to fit in a new toy that they haven’t had the chance to play with until now, and that means the NVMe field towards the customer.

NVMe disks were planned in our solution, but as a cache and buffer array. And on that occasion, they came up with the idea that if it were rebuilt like this, the infrastructure could be upgraded by the fact that we would have a third tier – so classic HDD, SSD, but also add a super-fast NVMe array towards the customers.

What to say. I (didn’t mind)… When someone else decided to release the budget for it, I didn’t mind. There’s nothing wrong with that. The fact that it messes up your schedule, redoes the hardware, you have to get other things and components from the supplier and, best of all, pressure him to deliver it on the same dates as the rest, so that it doesn’t roll around there. The technicians don’t care, that’s the designer’s concern. It’s just going to be better, bigger, faster, to be sure, they also conspired with the malls in this, because we’ll have something that others don’t have, so we can sell something that others don’t have or that are very expensive. So I clicked my heels and off you went

m to find.

The fact that it changes is a problem for the designer, technically it is not such a problem. The problem for those technicians is that these changes are usually not addressed in detail with the impact on all hardware. At the moment when a new technical solution came and all servers were equipped with a controller for ordinary disks and an operating system, special controllers for NVMe, installed HDD, SSD and NVMe, all PCI slots were stuffed, the originally planned Intel 4314 somehow could not handle the operation on those disks, they couldn’t count it.

The two socket solutions – I don’t know to what extent you know this solution – are Silver 4314 16-core with HT, 2.4 GHz and 64 PCI express lines. In dual socket times two. 120 lines. And the change, why I mention the controllers, from the original solution, when nobody cared much about the PCI lines, this great, amazing and fast installation caused that there were two 100 Gbps network devices, each one needed a PCI express 16, a normal controller, that an eight was enough, and two special NVMe controllers with two and four ports, so a total of 3 that needed sixteen. So all in all, all of a sudden the final box needed 80 PCI express lines. 64 – two socket – 128, it still comes out. Unfortunately, although I’m a supporter of Intel, I like it, in my opinion AMD doesn’t quite belong in the server world, even if someone will spit on me, it’s just my experience, those Intels couldn’t handle two sockets. It was figured out while burning up the servers during basic tests, even before the installation of the own platform, and now what to do with it?

Suppliers and partners helped us a lot in this, because in the end, in cooperation with them, we quickly came up with the option of switching everything, rebuilding, loading, and we arrived at a technical solution where two Intel sockets were replaced by one AMD socket. As one socket, the Epyc 7543 provided us with the same number of cores – there is only a higher frequency of 2.8 GHz – and the same number of PCI express lines, but in one CPU solution. This no longer showed any problems on the PCI express lines during the tests.

So we are talking about the fact that for all storage servers at that moment we have to replace the CPU, board, case and, unfortunately, also replace the memory, because we went from two sockets to one socket, so from the original configuration of 32 teks we had to go to 64. Brand new servers.

I have a non-technical question about this and a question that technicians don’t like – how much did it cost?

You ask about time, nerves, hair, sleep, etc. As for the finances, we chose a great partner for this, and as a result, this conversion finally stayed within the same budget as the original one. I admit that I don’t know how they did it to some extent, but except for a small change in the order of thousands of units, we got those servers for the original money.

Will you tell us who this miracle partner is?

Well, we didn’t want to make it marketing, but of course we put it everywhere. As part of the tender process, when we asked for all possible technologies, we finally chose the server technology of the Supermicro brand from the offers, and the supplier for us was Abacus. The guys there were very helpful, both the technicians who helped us at the time by testing different hardware configurations within the components that were available at the time. The biggest problem there was that another solution could be devised, but the components had to be available so that we wouldn’t have to wait another 4 months.

So the guys tried different configurations and figured out the one it would work on. Second, they built it on components they were able to get at the time. Of course, that price is no longer about the technicians, but about the approach of the company’s management. He also came to meet us.

It is true that we have been working with them for a long time in terms of hardware, so we have good experience with them, however, this was a project that other vendors also fought for, which we do not oppose, but I think it is another plus that he always tells us that the project is not only about the money, but also about what the partner is able to offer you, apart from just offering you a nice price. That’s not to say the price wasn’t the best from them. If she wasn’t, it wouldn’t go through management anyway. They were good in this direction as well. But the added value is not only in offering us a good price, which the big vendors can do, but the approach and cooperation when something happens, that was the experience we had for the first time, and I think that cooperation worked perfectly . We can therefore also thank Abacus for putting it together in the first place.

The next chapter is much more technical and it got to my head less, although that’s not entirely true either, because everything ended up getting to my head. But I didn’t have to be so involved in it.

The problem was at the network layer, so I have a network guy here to say what he didn’t like. of course it fell on my head, but they got what they wanted. So what you got and what you didn’t really want.

We got modern technology. How else in a new project. And we bought some nice optical unraveling. Of course, everyone is familiar with optics, one cable on one side, another cable on the other. I’ll pull it out of poison

noho port to the other and everything works fine. After my colleagues mounted all the servers in the racks, my colleague Honza and I came to the servers and started connecting the cables. Of course, we didn’t look at them in the office, after all, it’s a seven-meter span, I plug it in and I have a span, it makes sense. So we arrived at the data center, plugged in the optical splitter and started wiring. And the ugly thing awaited us only now, when we discovered that the last end, the 50 or 30 centimeters, was untangled. And it is a problem when you need to have two ports plugged in one rack and two ports in the other rack. The 50 cm is somehow not enough.

So we had to improvise on the fly because we got what we wanted, the latest and greatest, so we had it there and without admitting there was a mistake, we had to plug it in. Cables were expensive, custom made in Germany, rushed because they were not available. I don’t need to comment on other project stuff. They got what they wanted, I got it for them.

So we started plugging everything in. We have the optical splitter and my colleague tells me to go according to the documentation and he will connect it. We clicked the modules into all the bays, we have everything flashed, with the firmware we need, we did a click click, we went through all the servers completely, and with a colleague, with whom we installed the optical modules in the optical bays, we went from top to bottom. But we have two cases.

We have an Intel case where we have meshes – now we know it – next to each other and there are also installed modules, so you can see in the picture that they are upside down. Unfortunately, when we installed the modules, we did not notice this peculiarity, so we installed them all.

They didn’t put them backwards, they put them right. Yes, that’s how they’re supposed to be there, we tried them the other way around, and it didn’t work. But we didn’t find it weird at all, click click, turn, click click. Then we switched to AMD and there were network cards on top of each other. There modules are exactly the same in each tab. We didn’t think it was weird either, we just clicked it from top to bottom.

Then began about 6 or 8 hours of work with optical splices, when we had to figure out how to completely connect two racks in redundancy with the same number of optical splices, so that we did not have to connect anything in production.

The redundancy was on both the switches and the meshes, and each mesh had two more functions, so it had to be thrown away so that if one side dies, it will still be functional. If your network dies, it will work, the cable stops working, no problem. We have everything connected on both sides in order.

So we put everything together nicely after 8 hours of hard work, where it was 4 degrees in a warm alley, and we went to the office. The next day we started to figure out what and how, we handed over the servers to install the platform and we think everything is fine, all mounted and installed.

After installing the platform, when it was entering the testing phase, we wanted to try redundancies. When we got to that point, we started to find that LACPI, or bonds on meshes, when we configured them, they didn’t want to fold. So we started to find out why, because everything was installed correctly – the network infrastructure to meshes, modules and ports. We began to figure out why this happened to us, and only on something. We only got confused somewhere and that location was random.

We went from A to Z, first we took the Intel servers and we looked at what was wrong where. We even started to draw it on the board – ports 1 0 1 0 – everything is fine from left to right makes sense. Then we noticed according to the photos that we have the modules upside down and that according to the photos the network cards are also upside down, so 0 is 1 and 1 is 0. We made a cross on the physical part, because of which the LACPIs did not want to fold because there the IDs did not fit. Unfortunately, we didn’t know that.

So a colleague was sent to Prague to connect it so that we didn’t have to go there unnecessarily. The designer must solve all the problems of the dyke, so I also went to push the cables.

After 3 working days you have a nicely drawn picture of how everything is connected and you have the arrows made and the exact procedure when you get there what needs to be done. Of course, we passed on this procedure – I followed it – he came to the data center and called us that he could switch the first server and he could do exactly what we told him.

I look at the switch and it started clicking in two steps. He clicked it and we started to find that two steps were not enough because the cables were being shuffled completely differently on the physical part. So when my colleague and I clicked them there, we shouldn’t have done it vertically, but shuffled the different ones. Working with optical unraveling was a really cool thing.

You better prepare in advance and test the iron while it is lying in your chancel, and at least test that it will work the way you think it will. It won’t work the way you think at first.

Every new technology carries with it certain risks and we got burned pretty badly, whether with unraveling or HW as such, we didn’t notice that meshes are

after Our mistake.

When we changed everything on the switches, we found out that the twisted meshes are only on the Intel ones, not on the AMD ones. So they didn’t go there twice, to pierce it, but three times. Because when it was all painted over and partitioned, only half of it worked again.

And now when somebody needs to test that platform and you want to do tests and you can’t just rewire everything because it all dies and the whole platform can be redone at that point. So we had to be very careful to keep the test part alive as well.

So we were a bit like tinkerers, the cables now look like tangled braids instead of a nice tie, but it works. Then we felt at home there, they were smiling at us at the reception.

We got a question in the chat – were there any problems with the migration?

We divided the conversation about this project and about the things that went wrong or made the project difficult or complicated for us, so we divided it into two functional units – rather the hardware and project ones and then the deep side things in terms of configurations , communication, setup, because that was a whole other set of problems to deal with.

Today we covered more of the physical part, and next time we’ll look at the software and configuration stuff, and that’s where the migration is partially related. Of course we had no problems, everything was sunny, everything went as we planned, nothing happened at all, no data was lost, everything was great. And how we did it, we’ll talk about that next time.

In your defense, so that it doesn’t sound like it all went down the drain, I have to say that even the plan and the expectations you had have come true in many ways that you wouldn’t have expected.

The migration in the final was much more hectic precisely because the time was shifted, so the planned migration time had to be shortened. There was a push to get it done as soon as possible. We were pressed for time due to our relationship with the original provider, so in the end the migration went smoothly, no customer complaints, it went everywhere, no data was really lost, and it was done in less than half the time it took plan. That was the icing on the cake for me, as a designer, for the networker it was the last tombstone and then they had to take time off and not see them for a few weeks. I wasn’t even allowed to show myself to them because they weren’t even able to greet me slowly in the morning. But this was done.

But the details will be even more based on what we dealt with in the software and configuration part, because the network had to be adjusted a lot and, let’s say, certain harakiri had to be done there, so that the migrations could be done.

It’s over, it’s done, and the next project will be a no-brainer. 4 and times two. I think that it always ends… You have to take that into account, but the important thing is that everything can be solved in some way and if there are good people and there is a will, then in the end it doesn’t have to cost so much extra money, these troubles.

Caught Your Interest?

Our technicians will gladly make time for you.

Webinar - We build a new cloud and this got sc... arily out of hand Vol 1

Caught Your Interest?

Doporučené

We are here for you 24/7/365

Rádi s vámi probereme možnosti řešení pro vaše požadavky

Vzdálená podpora pomocí TeamViewer

TeamViewer Remote Assistance

Windows

Procesory

RAM

Storage

IP adresa

Linux

Procesory

RAM

Storage

IP adresa

We will be happy to talk about a solution fitting your needs

Rádi s vámi probereme možnosti řešení pro vaše požadavky

Rádi s vámi probereme možnosti řešení pro vaše požadavky

We Tailor an Offer Specifically
to Your Needs

We Tailor an Offer Specifically
to Your Needs

We will be happy to talk about a solution fitting your needs

We will be happy to talk about a solution fitting your needs

We will be happy to talk about a solution fitting your needs

We will be happy to talk about a solution fitting your needs

We will be happy to talk about a solution fitting your needs

We will be happy to talk about a solution fitting your needs

We will be happy to talk about a solution fitting your needs

Rádi vám zpracujeme nabídku na míru

Webinar - We build a new cloud and this got sc... arily out of hand Vol 1

Caught Your Interest?

Doporučené

We are here for you 24/7/365

Rádi s vámi probereme možnosti řešení pro vaše požadavky

Vzdálená podpora pomocí TeamViewer

TeamViewer Remote Assistance

Windows

Procesory

RAM

Storage

IP adresa

Linux

Procesory

RAM

Storage

IP adresa

We will be happy to talk about a solution fitting your needs

Rádi s vámi probereme možnosti řešení pro vaše požadavky

Rádi s vámi probereme možnosti řešení pro vaše požadavky

We Tailor an Offer Specificallyto Your Needs

We Tailor an Offer Specificallyto Your Needs

We will be happy to talk about a solution fitting your needs

We will be happy to talk about a solution fitting your needs

We will be happy to talk about a solution fitting your needs

We will be happy to talk about a solution fitting your needs

We will be happy to talk about a solution fitting your needs

We will be happy to talk about a solution fitting your needs

We will be happy to talk about a solution fitting your needs

Rádi vám zpracujeme nabídku na míru

We Tailor an Offer Specifically
to Your Needs

We Tailor an Offer Specifically
to Your Needs