Planet WLCG

May 07, 2012

UKI-SCOTGRID-DURHAM, Durham, UK

One XOS, Great Big Purple Packet Eater. Sure looks good to me.

So we haven't been blogging a great deal since December and for good reason. We found ourselves in the exciting position of being given additional funding to enhance our network capability and also we had additional equipment to install into the cluster.

First things first however, as you may have read we have had no end of issues with the older network equipment. We had a multi-vendor environment which, while adequate for 800 analysis jobs and 1200 production jobs, wasn't quite up to cutting the mustard as we couldn't expand from there.

The main reason was the 20 Gig link between the two computing rooms which was having real capacity issues. Also, add in issues between the Dell and Nortel LAG and associated back flow problems, sprinkled with a buffer memory issue on the 5510s and you get the picture. In addition to this we were running out of 10 Gig ports and therefore couldn't get much bigger without some investment.

Therefore, the grant award was a welcome attempt to fix this issue. After going to tender we decided upon equipment from Extreme Networks. The proposed solution allowed for a vast 160 Gigabit interconnect between the rooms broken into two resilient link bundles in the Core and an 80 Gigabit Edge layer. In addition to this connection we also installed a 32 core OM4 grade fiber optic network for the cluster which will carry us into the realms of 100 Gigabit connections, when it becomes available and cheap enough to deploy sensibly.

We now have 40 x 40 gigabit port, 208 x 10 gigabit ports and 576 1 x Gigabit ports available for the Cluster.

 There is quick and clever and here it is

The new deployment utilises X670s  in the Core and X460s at the Edge.

The magic of the new Extreme Network is that it uses EAPS, so bye bye Spanning Tree and good riddance as well as MLAG which allows us to load share traffic across the two rooms so having 10 Gigabit connections for disk servers in one room is no longer an issue.

Then it got a bit better. Due to the Extreme OS we can now write scripts to handle events within the network which ties in with the longer term plan for a Cluster Expert System (ARCTURUS) which we are currently designing for test deployment. More on this after August.

Finally, it even comes with its own event monitoring software, Ridgeline which gives a GUI interface to the whole deployment.

We stripped out the old network installed the new one and after some initial problems with the configuration, which were fixed in a most awesome fashion by Extreme got the new one up and running. What we can say is that the network isn't a problem anymore, at all.

This has allowed us to start to concentrate upon other issues within the Cluster and look at the finalised deployment of the IPV6 test cluster which has benefited in terms of hardware from the new network install. Again, more on this soon.

Right, so now to the rest of the upgrade we have also extended our cold isle enclosure to 12 racks, have a secondary 10 Gig link onto the Campus being installed and have a UPS. In Addition to this we refreshed our storage using Dell R510s and M1200s as well as buying 5 Interlagos boxes to augment the worker node deployment.

 The TARDIS just keeps growing

We also invested in an experimental user access system with wi-fi and will be trying this out in the test cluster to see if a wi-fi mesh environment can support a limited number of grid jobs.  As you do.

In addition to this we improved connectivity for the research community in PPE at Glasgow and across the Campus as a whole, with part of the award being used to deliver the resilient second link and associated switching fabrics.

It hasn't been the most straight forward process as the decommissioning and deployment work was complex and very time consuming in an attempt to keep the cluster up and running as long as possible and to minimise down times.

We didn't quite manage this as well as expected due to the configuration issues on the new network but we have now upgraded the entire network and have removed multiple older servers from the cluster to allow us to enhance the entire batch system for the next 24 - 48 months.

As we continue to implement additional upgrades to the cluster we will keep you informed.
For now it is back to the computer rooms.

by Mark Mitchell (noreply@blogger.com) at May 07, 2012 05:40 PM

GridPP At The Top Of Europe

This news article appeared on the GridPP website and is worth reposting to our blog as it gives an overview of the collaborations efforts to date within the WLCG and with the Non High Energy Physics (HEP) communities.
GridPP At The Top Of Europe

by Mark Mitchell (noreply@blogger.com) at May 07, 2012 05:39 PM

Preparing for IPv6

Generally, we don't repost news items on the blog but this BBC article gives a good indication of the changes underway globally for implementing IPv6. currently the Glasgow Scotgrid test cluster is being revamped post our last spending cycle and we are embarking on a full test programme of IPv6 specifically around running Grid services. As this work progresses we will regularly update the blog.

by Mark Mitchell (noreply@blogger.com) at May 07, 2012 05:38 PM

Stockholm LHCONE Meeting

Kristall in the Sergels Torg Stockholm


We were in attendance at the LHCONE  meeting at KTH in Stockholm last week. The purpose of this collaboration is to investigate the efficient use of networks globally for LHC research. As usual it was an excellent meeting where the technical mechanisms for current and future network deployments were discussed and considered.

The agenda can be found here. Some of the highlights of the meeting included an excellent presentation by Erwin Laure on the Swedish and Scandinavian Super Computing and Grid computing infrastructure, Joe Mambretti's presentation on the GLORIAD global research network, Mike O'Connor's discussion on the technical configurations required to avoid asymmetric routing issues between the LHCONE and the current production networks and  Domenico Vicinanza's presentation on Perfsonar MDM.

In addition to these presentations technical discussions surrounding various technologies surrounding bandwidth reservation, ultra high speed networking and  Open Flow technologies were held. As these discussions develop through the network architecture groups we will keep you up to date. 


Also, the weather in Stockholm was exceptional and the KTH Campus is worth a visit for its architecture alone. I would like to thank our hosts and all the other attendees for making this such an enjoyable and informative couple of days.




 KTH Campus Stockholm
 

by Mark Mitchell (noreply@blogger.com) at May 07, 2012 05:38 PM

April 05, 2012

UKI-NORTHGRID-LIV-HEP, Liverpool, UK

The Big Upgrade in pictures

New Cisco blades, engines and power supplies

DELLs boxes among which new switches
Aerial view of the old cabling
Frontal view of the mess
Cables unplugged from the cisco
Cisco old blades with services racks still connected
New cat6a cisco cabling aerial view nice and tidy
Frontal view of the new cisco blades and cabling nice and tidy
Old and new rack switches front view
Old and new rack switches rear view
Emptying and reorganising the racks
Empty racks ready to be filled with new machines
Old DELLs cemetery
Old cables cemetery. All the cat5e cables going under the floor from the racks to the cisco half of the cables from the rack switches to the machines and all the patch cables in front of the cisco shown above have gone.
All the racks but two have now the new switches but the machines are still connected with cat5e cables. Upgrading the network cards will be done in Phase two one rack at the time to minimize service disruption.









The downtime lasted 6 days. Everybody who was involved did a great job and the choice of 10GBASE-T was a good one because the ports auto-negotiation is allowing us to run at 3 different speeds on the same switches: PDU 100Mbps, old WN and storage at 1Gbps, and the connection with the cisco is 10Gbps. We also kept one of the old cisco blades for connections that don't require 10Gbps such as the out-of-band management cables plus two racks of servers that will be upgraded at a later stage are still connected at 1Gbps to the cisco. And we finished perfectly in time for the start of data taking (and Easter). :)

by Alessandra Forti (noreply@blogger.com) at April 05, 2012 10:46 PM

March 31, 2012

UKI-NORTHGRID-LIV-HEP, Liverpool, UK

So long and thanks for all the fish


In 2010 we had already decommissioned half of the original mythical 2000 (1800 for us) EM64T CPUs Dell cluster that allowed us to be the 4th of the top 10 countries in EGEE in 2007.

 




















This year we are decommissioning the last 430 machines that served us so well for 6 years and 2 months. So... so long and thanks for all the fish.

by Alessandra Forti (noreply@blogger.com) at March 31, 2012 03:07 PM

DPM database file systems synchronization

The synchronisation of the DPM database with the data servers file systems has been a long standing issue.  Last week we had a crash that made more imperative to check all the files and I eventually wrote a bash script that makes use of the GridPP DPM admin tools. I don't think this should be the final version but I'm quicker with bash than with python and therefore I  started with that. Hopefully later in the year I'll have more time to write a cleaner version in python that can be inserted in the admin tools based on this one. It does the following:

1) Create a list of files that are in the DB but not on disk
2) Create a list of files that are on disk but not in the DB
3) Create a list of SURLs from the list of files in the DB but not on disk to declare lost (this is mostly for atlas but could be used by LFC administrators for other VOs)
4) If not in dry run mode proceed to delete the orphan files and the orphan entries in the DB.
5) Print stats of how many files were in either list.

Although I put few protections this script should be run with care and unless in dry run mode shouldn't be run automatically AT ALL. However in dry run mode it will tell you how many files are lost and it is a good metric to monitor regularly as well as when there is a big crash.

If you want to run it, it has to run on the data servers where there is access to the file system. As it is now it requires a modified version of /opt/lcg/etc/DPMINFO that point to the head node rather than localhost because one of the admin tools used does a direct mysql query. For the same reason it also requires dpminfo user to have mysql select privileges from the data servers. This is the part that really could benefit from a rewriting in python and perhaps a proper API use as the other tool does. I also had to heavily parse the output of the tools which weren't created exactly for this purpose and this could also be avoided in a python script. There are no options but all the variables that could be options to customize the script with your local settings (head node, fs mount point, dry_run) are easily found at the top.

To create the lists it takes really little time no more than 3 minutes on my system but it depends mostly on how busy is your head node.

If you want to do a cleanup instead it is proportional to how many files have been lost and can take several hours since it does one DB operation per file. The time to delete the orphan files also depends on how many and how big they are but should take less than DB cleanup.

The script is here: http://www.sysadmin.hep.ac.uk/svn/fabric-management/dpm/dpm-synchronise-disk-db.sh

by Alessandra Forti (noreply@blogger.com) at March 31, 2012 03:06 PM

February 29, 2012

RAL-LCG2, Oxford, UK

Summary of Changes Being Made at the Start of 2012.

While the LHC has been in its winter shutdown we have been preparing for 2012 data taking. We are almost at the end of the changes planned before the LHC startup, although longer term plans mean there is further work to do.

For Castor we have just completed an upgrade to version 2.1.11-8. As part of this upgrade we have also brought new hardware into use for the Castor head nodes. The Castor SRMs were upgraded to version 2.11 a few weeks ago, again running on new hardware.

The architecture of the underlying Oracle database infrastructure, along with some of its hardware, is being updated. The first step towards this was carried out in the first week we returned to work in the New Year when the Castor databases were moved to a temporary configuration. Following testing we are ready for the next step in this migration, being scheduled for next week, which will result in us having the planned new architecture in place for the Castor databases using Oracle Data Guard to ensure two copies of the databases are continually updated.

For the Grid Services there has is a steady program of moving from glite to UMD versions of software, with the opportunity being taken to update underlying operating systems. The batch server, BDIIs, APEL, LB and some of the WMS services have already been upgraded. Others, including MyProxy, CEs and FTS remain to do.

The Tier1 Network infrastructure has required some work, both to investigate a known problem and to prepare for other changes in the machine room.

On top of this the LHC VOs have had other work going on. This included updating central (CERN) systems to Oracle 11 which led to a requirement for the  Atlas & LHCb 3D databases at RAL to be upgraded to Oracle11 too. For Atlas the LFC has been migrated from RAL to CERN. We have tried to co-located other changes with these events, although our own timescales and constraints have meant this has not always been possible.

Furthermore our recent purchases of disk and CPU capacity have arrived and are undergoing testing ahead of being placed into service. One batch of new CPU capacity is expected to be deployed in the next few days, the remaining batch within a few weeks.  The new disk servers will be ready within a few weeks and we will deploy additional disk capacity to meet our pledges by the start of April.

In  summary the most significant remaining work to be done before the start of the LHC for the 2012 data taking is the next step of the Castor database architecture upgrade, which will require a short (couple of hours) downtime to Castor. Some more of the Grid Services will also be updated, including the FTS and MyProxy services before LHC startup.

Despite all the changes made during this LHC shutdown there are further upgrades and changes to be carried out in the longer term, and of particular note this includes migrating the Castor databases to Oracle 11, which will in turn need a further update to Castor. In addition, recently purchased networking equipment will be brought into use.

As usual we use the weekly RAL Tier1 – Experiments Liaison meeting as a forum to announce and discuss these changes, with appropriate outages then declared in the usual way via the GOC DB.

by Gareth Smith at February 29, 2012 05:07 PM

February 27, 2012

UKI-SCOTGRID-DURHAM, Durham, UK

LSC files and emailAddress redux

This post involves a very complicated journey to get to a simple place.

The fundamental problem is around the catchy titled OID 1.2.840.113549.1.9.1

No, wait, let me take a step back. On the Grid, we use certificates for authentication. An X509 certificate is, as with most certificates, a signed set of assertions, and a public key. As with the rest of the X500 standards, it's native language is something called ASN.1 (Abstract Syntax Notation 1) (aka X208, and the later revision X680), held in files encoded by the DER (Distinguished Encoding Rules).

The fundamental takeaway from that tech-dump is that X509 certificates are not in plain text, and there are multiple standards required in order to understand their contents.

So when someone says their certificate Distinguished Name is '/O=SomeUni/OU=SomeDept/L=group/CN=JohnSmith' ... that's not quite accurate. What they really mean is that there certificate DN is some set of objects that can be unambiguously matched to that ASCII text.

That happens because there are universally agreed mappings between the actually stored OID and the text representation of them (e.g. CN is OID 2.5.4.3).

Unfortunately, the agreement breaks down a bit for the emailAddress field; with some software mapping it to Email, and others to emailAddress. By the PKCS#9 standard, one could argue that it should be emailAddress - but that doesn't help us get software working.

Fortunatly, all of this is not a problem unless we want to store certificate DN's in ASCII, _and_ want to have email addresses in the DN.

Yeah, you can see where this is going, can't you?

In the UK, Jens has been working to allow us to not have them in DN's. However, in the short term, they are present.

One particular case where ASCII representations of the DN are used is in LSC files - which are used to authenticate VOMS servers. What happens is if the VOMS server DN matches the DN in the LSC file, and the cert was signed by the CA DN in the LSC file, _and_ the certificate chain is signed by a trusted root, then it's valid. This process means that we don't need to distribute lots of VOMS server certs, just the root CA's, and a small note (that shouldn't change over renewals) of the server DN.

I've been tidying up our ARC install here, and during the process managed to break things. Not unusual for me, (one of the reasons I avoid tiding at all costs!), but this one was quirky. I'd put the vomsdir under CFEngine control, so that it was sync'd with all the other servers, and suddenly it stopped accepting the scotgrid VO.

Root cause, as if you can't guess by now, LSC file, and the emailAddress. Looks like the gLite stack expects it one way, and ARC the other. Of course, by the time you read this, that's probably been fixed somewhere, but not in the version we had installed.

It turns out that there's one trick in LSC files that saves this case. Let me put the LSC file in here:

/C=UK/O=eScience/OU=Glasgow/L=Compserv/CN=svr029.gla.scotgrid.ac.uk/Email=grid-certificate@physics.gla.ac.uk
/C=UK/O=eScienceCA/OU=Authority/CN=UK e-Science CA
------ NEXT CHAIN ------
/C=UK/O=eScience/OU=Glasgow/L=Compserv/CN=svr029.gla.scotgrid.ac.uk/emailAddress=grid-certificate@physics.gla.ac.uk
/C=UK/O=eScienceCA/OU=Authority/CN=UK e-Science CA



The 'NEXT CHAIN' line lets one put multiple entries in the file. However, it appears that ARC isn't reading multiple, only the first one. So, in this case, I put the ARC friendly one first, so it matches fine - and the gLite stack tries again, finds the second, and thus suceeds.

Imporant notes: I can't find anyone else with a field report of NEXT CHAIN working in the gLite stack. This is such a field report. It doesn't appear to work with ARC.

by Stuart Purdie (noreply@blogger.com) at February 27, 2012 04:20 PM

December 21, 2011

UKI-SCOTGRID-DURHAM, Durham, UK

Batch system juggling

We've been a bit quiet up here recently. This is normally a sign of either nothing interesting happening, or entirely too many interesting things happening. Opinions on that may divide, but I think it's closer to the latter...

One of the recent bits of fun that occurred was with our batch server. This story actually starts a long time ago; about this time last year. At that point, we started to get intermittent memory errors from the Torque server - corrected by ECC - but that's generally a sign that the RAM's about to fail. Given that the batch server is single point of failure for a site, that's not a good thing.

So I spent some time preparing a spare box, and being ready to move the batch system over, in case it failed over the winter break. Which, after all that prep, it didn't, and the errors stopped. On the expectation that the current hardware was nearing end of life, we ordered a new box early this year, and have had it sitting in a machine room for a while.

Unfortunately we didn't get time to have it running a tested batch system until our power supply started to ... well, insert colourful metaphor here, describing the 8 months where we were affected by lack of power.

Power got to stable supply in September, and so to catch up on things. One of the things we got around to was software versions. Whilst we didn't intent to update the Torque version, and managed to avoid it for a bit, the gLite developers eventually managed to sneak the update past us as part of an ordinary gLite update. Strictly, this didn't affect the batch server, just all the CE's, making them incompatible with the previous version of Torque.

Whilst a clever manoeuvre, reminiscent of Odysseus' Pony, it did leave us with a conundrum of either reverting the gLite update, or running forward with it. Neither were options of good character, but running forward did have some actual documentation; hence it was full speed ahead.

Which worked out well enough. The Torque 2.5.7 packages were set to use Munge, so getting that installed and tested as a first step helped it go smoothly. To preserve compatability in file locations, we used /etc/sysconfig/pbs_mom to put the pbs working directories in the same place as previously - meaning we didn't have to reconfigure any other tools.

What didn't go so smoothly was the memory leak in the server.

Which gave it a runtime of around 36 hours between crashes. Actually, not even crashes - we found that the pbs_server process hit either


12/05/2011 10:19:12;0080;PBS_Server;Req;req_reject;Reject reply code=15012(PBS_Server System error: No child processes MSG=could not unmunge credentials), aux=0, type=AlternateUserAuthentication, from tomcat@svr021.gla.scotgrid.ac.uk

or

10/29/2011 18:11:24;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::Cannot allocate memory (12) in send_job, fork failed


and then sat around moaning. Had it crashed hard, then the auto-restart would have caught it. Ho, hum, one for the Fast Fail philosophy there.


By this point, my proof reader is pointing out that I started off talking hardware, and now talking software. Punchline is that the new server that we never got a chance to use has a lot more RAM than the old server. Therefore we wanted to move the server from the old hardware to the new, to give it a lot more RAM space. That won't fix the memory leak, but will mitigate the problem a bit.

Conventionally, this would involve draining the cluster, repositioning the CE's and then starting up everything again. Had we done that, this blog post would be over now.

Instead, we did a rolling update. This let us move things over without having to do a full drain. The biggest problem with a full drain is that, while most of the jobs finish within a shorter period of time that then limit, there are always some that take the full duration. This leaves us with an empty cluster, doing nothing, for 24 hours or so, wainting on a couple of jobs to finish.

So, instead, by moving things in small batches, then we can keep most of the nodes working, and thus get more work out of things. Step zero is to disable cfengine, otherwise it tends to try and 'fix' things part way through.

Step one is to drain a CE, which we did over a weekend, and a small number of nodes, which we put offline on the Sunday morning.

Come Monday, I set up and tested basic operations with the new batch server, and then moved the freed up nodes across to it. Once those were tested (which shook out a couple of issues about versioning of some libs), point the CE at the new batch server, and then run a test job though it. (It turns out that Atlas are fast enough to sneak some pilots through a 2 minute window for a test job. However, only a few, so they actually functioned as effective tests, without compromising the site if they failed).

After that, it's time to offline another CE, and then some more nodes, and start moving nodes over when they were empty. In the end I scripted this:


#!/bin/sh

NODE=$1
RUNNING=$(qstat -n -1 | grep $NODE | wc --lines)

if [ "x${RUNNING}" != "x0" ]
then
echo $NODE: Still $RUNNING jobs going, skipping
exit 2
fi

CORES=$(qmgr -c "print node ${NODE}" | grep "np = " | cut -d= -f2)

FROM=svr666
TO=svr999

echo $NODE: Moving to ${TO} with ${CORES} cores

ssh ${TO} "~/addNode.sh ${NODE} ${CORES}"

ssh ${NODE} "service pbs_mom stop"
scp config.mom.svr666 ${NODE}:/var/spool/pbs/mom_priv/config
ssh ${NODE} "service pbs_mom start"

ssh ${FROM} "~/deleteNode.sh ${NODE}"


In theory one can run qmgr remotely, rather than ssh-ing to the batch servers and running a script. In practice, with the different versions of Torque, I couldn't get that to work. Note the automation of the mom config switch as well; and that this script checks that the node is empty.

This reduced the gradual move of nodes to a process of croning the script, and offlining nodes occasionally.

The net result was that we were operating at around 80% capacity for 48 hours, and it was all rather uneventful - in a good way. The final step was to update cfengine config and re-enable it.

One of the plus points of the above script is that it should be simple to adapt to two distinct batch systems; which means if we end up moving away from Torque, we should be able to do that without downtime too.

by Stuart Purdie (noreply@blogger.com) at December 21, 2011 10:11 PM

December 20, 2011

RAL-LCG2, Oxford, UK

RAL Tier1 – Plans for Christmas & New Year Holiday

RAL closes at 3pm on Friday 23rd December and will re-open on Tuesday 3rd January. During this time we plan for services at the RAL Tier1 to remain up. The usual on-call cover will be in place (as per nights and weekends). This cover will be enhanced by daily checks of key systems. Some hardware interventions, such as to swap out faulty disks will also take place over this time.

Furthermore we do not have support on 25/26 December & 1st January for some services we rely on. The impact of any failures around these particular dates may therefore be more extended. Also, over the holiday we have relaxed our expectation that the on-call person will respond within two hours, particularly on the specific dates just mentioned.

During the holiday will check for tickets in the usual manner. However, only service critical issues will be dealt with.

The status of the RAL Tier1 can be seen on the dashboard at:
http://www.gridpp.rl.ac.uk/status/

Gareth Smith

by Gareth Smith at December 20, 2011 11:21 AM

December 01, 2011

UKI-NORTHGRID-LIV-HEP, Liverpool, UK

DPM upgrade 1.7.4 -> 1.8.2 (glite 3.2)

Last week I upgraded our DPM installation. It was a major change because I upgraded not only the DPM version but also the hardware and the backend mysql version.

I didn't take any measures this time before and after. I knew that becoming an alpha site in atlas was taking its toll on the old hardware and many of the timeouts were from gridftp but there had been a reappearance of the mysql ones I talked about in previous posts at the level that even restarting the service was hard.

[ ~]# service mysqld restart
Timeout error occurred trying to stop MySQL Daemon.

Stopping MySQL: [FAILED]

Timeout error occurred trying to start MySQL Daemon.


So I decided that the situation had become unsustainable and it was time to move to better hardware and software versions.

* Hardware: 2 cpu, 4GB mem, 2x250 GB raid1 -> 4 cores (HT on = 8 job slots), 24GB mem, 2x2TB raid1

There is no why here it was ok when we had limited access but the recent load was really too much for the old machine even with all the tuning. Suspected bad blocks on disks could be possible but no red leds nor hardware errors were reported by the machine.

* Mysql: 5.0.77 -> 5.5.10

Why mysql 5.5? Because InnoDB is the default engine and they have improved performance and instrumentation. On top of other things that we might actually start to use. A good blog article about the 5 reasons to move is this one: 5 good reasons to upgrade to mysql 5.5.

MySQL 5.5 is not in EPEL yet, but I found this CentOS community site that has the rpms and the instructions to install them.

After the installation I've also optimized the database partially with what I had already done in July, partly running a handy script mysqltuner.pl. This last one helps with variable you might not even know and even if you know them it tells you if they are too small. You need to be patient and let pass few hours before run it again.

* DPM: 1.7.4 -> 1.8.2

Why DPM 1.8.2 from glite 3.2? I would have gone for the UMD release or even the EMI one but then glite 3.2 was moved to production earlier than those and since I waited for this release since at least April I didn't think about it twice when I saw the escape route. It was really good timing too as it happened when I really couldn't postpone an upgrade anymore. You can find more info in the release notes. Among other reasons to upgrade: srmv2.2 in 1.7.4 has a memory leak which wasn't noticeable until the load was contained but for us exploded in October and is the reason I had to restart it every two days in the past few weeks.

Below the steps I took to reinstall the head node

On the old head node

* Set the site in downtime, drain the queues and kill all the remaining jobs.

* Turn off all the dpm and bdii services on the old head node

* Make a dump of the current database for backup

mysqldump -C -Q -u root -p -B dpm_db cns_db > dpm.sql-20111125.gz

* Download dpm-drop-requests-tables.sql supplied by Jean Philippe last July

wget http://www.sysadmin.hep.ac.uk/svn/fabric-management/dpm/dpm-drop-requests-tables.sql

* Drop the requests tables. This step is really useful to avoid painful reload times as I said in this other post about DPM optimization and because it drastically reduces the size of ibdata1 when you reload which has also benefits (my ibdata1 was reduced from 26GB to 1.7GB). Still you need to plan because it might take few hours depending on the system. On my old hardware it took around 7 hours.

mysql -p < dpm-drop-requests-tables.sql

* Dump reduced version of the database

mysqldump -C -Q -u root -p -B dpm_db cns_db > dpm.sql-20111125-v2.gz


* Copy both to a WEB server where they can be downloaded from in a later stage.

* Update the local repository for DPM head node and DPM disk servers. Since it is still glite I just had to rsync the latest mirror to the static area.

On the new head node

* Install the new machines with a DPM head node profile. This was again easy since it is still glite no changes were required in cfengine.

* Most of the following is not standard and I put it in a script. If you have problems with users IDs created by avahi packages you can uninstall them with yum removing all the dependencies and let them be reinstalled by the bdii dependency chain. It should work also uninstalling them with rpm -e --nodeps. This leaves redhat-lsb (which is what the bdii depends on) untouched but I haven't tried this last method. Here are the commands I executed:

# Get the dpm DB file
rm -rf dpm.sql-20111125-v2.gz*
wget http://ks.tier2.hep.manchester.ac.uk/T2/tmp/dpm.sql-20111125-v2.gz


# Install mysql5.5
rpm -Uvh http://repo.webtatic.com/yum/centos/5/latest.rpm
yum -y remove libmysqlclient5 mysql mysql-*
yum -y clean all

yum -y install mysql55 mysql55-server libmysqlclient5 --enablerepo=webtatic

service mysql stop

rm -rf /var/lib/mysql/*

# Get the local my.cnf
cfagent -vq

service mysqld start


# Install the DPM rpms
yum -y remove cups avahi avahi-compat-libdns_sd avahi-glib
yum -y install glite-SE_dpm_mysql lcg-CA


# Modify sql scripts for mysql5.5

cd
/opt/lcg/share/DPM/
for a in create_dp*.sql; do sed -i.old 's/TYPE/ENGINE/g' $a;done
grep ENGINE *


# Run YAIM and upload old DB

cd

/opt/glite/yaim/bin/yaim -c -s /opt/glite/yaim/etc/site-info.def -n glite-SE_dpm_mysql


mysql -u root -p -C < /root/dpm.sql-20111125-v2.gz


# NECESSARY FOR THE FINAL UPDATES

/opt/glite/yaim/bin/yaim -c -s /opt/glite/yaim/etc/site-info.def -n glite-SE_dpm_mysql


* You will need to install the dpm-contrib-admintool rm because it is not in the glite repository it might be in the EMI one. Last time I heard it made it to ETICS. If you can't find it there's still the sysadmin repo version and related notes on the GridPP wiki (Sam or Wahid welcome to leave an update on this one).

* To upgrade the disk servers I just updated the repository, upgraded the rpms and rerun yaim.

by Alessandra Forti (noreply@blogger.com) at December 01, 2011 11:21 AM

October 06, 2011

RAL-LCG2, Oxford, UK

T10000C Tapes Brought into Production for Atlas.

In order to have sufficient tape capacity to meet the requirements of the VOs the Tier1 has to keep up-to-date with recent tape technologies. To this end Atlas data was switched over to start writing to the newer T10000C tapes, with a capacity of 5TB per tape, on Thursday 29th September.

A couple of years ago all the Tier1 data was held on T10000A tapes, each with a capacity of 500GBytes. During the second half of 2010 CMS data was moved to the T10000B tapes with 1TB capacity. (Both the ‘A’ and ‘B’ tapes share the same media, but the higher capacities require ‘B’ tape drives). The plan is to move Atlas, and then other VOs that have been still using the ‘A’ tapes, onto the ‘C’ tapes. This is part of a longer term strategy that sees a leapfrogging of tape technology.

The move to the T10000C tapes had become an operational necessity. Some initial teething problems with the T10000C tapes, now resolved with the vendor and proven during testing, had delayed the introduction of the new tapes & drives. Recently we have seen a very high rate of tape writing, with up to around 40 of the ‘A’ tapes written on some days in the last couple of weeks. Our supply of spare A/B media was rapidly reducing and we were faced with the possibility of needing to purchase more of the older media in order to keep up. This would have meant spending a significant amount of money without long term benefit.

There are some issues that remain to be understood with the new tapes. In particular we do not yet see the same performance in operation as we did during our testing. This is affected by writing smaller files and some network tuning remains to be done. However, with the T10000C tapes now in use with new Atlas data being written to them we can also begin the task of migrating Atlas data from the ‘A’ to the ‘C’ tapes – this is a significant operation with around 1.5PetaBytes to be moved.

by Gareth Smith at October 06, 2011 09:45 AM

September 28, 2011

UKI-NORTHGRID-LIV-HEP, Liverpool, UK

September 23, 2011

UKI-SCOTGRID-DURHAM, Durham, UK

Leaving Lyon

The EGI Tech Forum is winding down, with only a few talks remaining. It's been a great meeting, with a wide range of talks on all areas of Grid Computing. Lots to think about and new ideas to try out!

by David Crooks (noreply@blogger.com) at September 23, 2011 03:21 PM

September 21, 2011

UKI-SCOTGRID-DURHAM, Durham, UK

Scotgrid goes South

Last we week attended the bi-annual GridPP Collaboration meeting.
The venue this time was CERN itself and the meeting was, as ever, incredibly useful.

We were lucky enough to have presentations from the Experiments, the LHC, EGI and the WLCG community as well as presentations from across the UK collaboration.

A full programme of the meeting is available here:

http://www.gridpp.ac.uk/gridpp27/



Above is a picture of our own Dr Crooks presenting on the Glasgow Security Model

by Mark Mitchell (noreply@blogger.com) at September 21, 2011 02:02 PM

September 19, 2011

UKI-SCOTGRID-DURHAM, Durham, UK

EGI Tech Forum 2011

Bonjour Lyon!

After last week's GridPP 27 meeting in CERN, this week we are in Lyon for the 2011 EGI Tech Forum, running from Monday until Friday this week. You can follow the Forum online using some of the links here.

More later - time now to find some coffee before the first session...

by David Crooks (noreply@blogger.com) at September 19, 2011 08:46 AM

September 09, 2011

UKI-NORTHGRID-LIV-HEP, Liverpool, UK

cvmfs upgrade to 2.0.3

Last week I upgraded the cvmfs on all the WN to cvmfs-2.0.3. The upgrade for us required two steps.

1) change of repository: since Manchester was the first to use the new atlas setup we were pointing to CERN repository. The new setup has now become standard so I just had to remove the override variable CVMFS_SERVER_URL from atlas.cern.ch.local. The file is distributed by cfengine so I just changed it in cvs.

2) rpms upgrade: I had some initial difficulties because I was following the instructions for atlas T3 - which normally work also for T2 - that suggested to install cvmfs-auto-setup rpm. This rpm runs service cvmfs restartautofs and in the instructions it was suggested also to rerun it manually. This on busy machines causes the repositories to disappear and requires a service cvmfs restartclean which wipes the cache off and is not really recommended in production. In reality none of this is really necessary and a simple

yum -y update cvmfs cvmfs-init-scripts

is sufficient. I could add the rpms version in cfengine and that was enough. The change from one version to another happens at the first unmount. Forcing this with a restartautofs is counterproductive (thanks to Ian for pointing this out).

Next week there should be a bug fix version that will take care of slow mount and some slow client tools routines on busy machines.

http://savannah.cern.ch/bugs/?86349

But since the upgrade procedure is so easy and the corrupted files problem

http://savannah.cern.ch/support/?122564

is fixed in cvmfs >2.0.2 I decided to upgrade anyway on Wednesday to avoid further errors in atlas and possibly lhcb.

NOTE: Of course I tested each step on few nodes to check everything worked before rolling out with cfengine on all nodes. Always a good practice not to follow recipes blindly!

by Alessandra Forti (noreply@blogger.com) at September 09, 2011 01:06 PM

August 25, 2011

UKI-SCOTGRID-DURHAM, Durham, UK

Busy Disks

After checking a test 10 gig Disk Server deployment we uncovered an interesting pattern in storage network activity and how our 10 Gig switch copes with multiply connections at 10 Gigabit. The captures below were taken over a 5 minute window of operation and show just how bursty the traffic patterns from these devices can be.

The graphs show all interfaces on our Dell 8024F and the measurement window is in Mbps. The order is top to bottom with the initial capture at the top.




While the Disk servers have been hammering away the round trip time intra room has been on average 0.40 msec between devices as the CPU on the core Dell seems more than happy to be handle these loads as its utilisation is approximately 20% presently.

We are planning to enable QOS metrics on disk server traffic shortly to test the response times on QOS and Non-QOS disk servers.


by Mark Mitchell (noreply@blogger.com) at August 25, 2011 07:00 PM

News Flash from ScotGrid Labs

In my last post, we investigating deployments of IPv6 on the test Cluster, the 1st one of which was using SLAAC to assign addressing to hosts. Interestingly enough it worked, first time out the tin.

An IPv6 Traceroute from the web is shown below:

traceroute to 2001:630:40:ef0:230:48ff:fe5a:4b7 (2001:630:40:ef0:230:48ff:fe5a:4b7), 30 hops max, 40 byte packets
 1  2001:1af8:4200:b000::1 (2001:1af8:4200:b000::1)  1.600 ms  1.813 ms  1.882 ms
 2  2001:1af8:4100::5 (2001:1af8:4100::5)  1.320 ms  1.392 ms  1.465 ms
 3  be11.crs.evo.leaseweb.net (2001:1af8::9)  2.587 ms  2.631 ms  2.619 ms
 4  linx-gw1.ja.net (2001:7f8:4::312:1)  8.475 ms  8.466 ms  8.453 ms
 5  ae1.lond-sbr4.ja.net (2001:630:0:10::151)  78.338 ms  78.388 ms  78.376 ms
 6  2001:630:0:10::109 (2001:630:0:10::109)  9.900 ms  9.479 ms  9.446 ms
 7  so-5-0-0.warr-sbr1.ja.net (2001:630:0:10::36)  13.320 ms  13.196 ms  13.317 ms
 8  2001:630:0:10::296 (2001:630:0:10::296)  18.705 ms  18.542 ms  18.793 ms
 9  clydenet.glas-sbr1.ja.net (2001:630:0:8044::206)  18.947 ms  18.931 ms  18.948 ms
10  2001:630:42:0:3e::9a (2001:630:42:0:3e::9a)  19.434 ms !X  18.214 ms !X  17.682 ms !X


The next phase of testing will be to enable a webserver to speak in both IPv4 and IPv6 using this access mechanism and then onto a Grid services .


I will post up a more detailed explanation of the mechanisms used for this soon.

by Mark Mitchell (noreply@blogger.com) at August 25, 2011 06:23 PM

August 23, 2011

UKI-SCOTGRID-DURHAM, Durham, UK

Night of the Return of the Living Worker Nodes

As Glasgow is currently being used as one of the sets for World War Z, we thought it only apt that we too resurrect the dead and get them to do our bidding. No, we haven't embraced "mad" science.

During the power work  we decided to alter the layout of 243d. Historically, the room had housed a mainframe including operators booths. One of these booths still existed within 243d, so we took down one of the walls and added a new cabinet.

While the work was being conducted to remove the wall we covered the cluster and powered it off to minimise dust ingestion. If you wish to gift wrap a cluster we have plenty of experience in this field. However, our wrapping is limited to blue plastic presently.



After the wall had been removed, we cleared out the computer room and re-organised the storage cabinets, cabling and computing cabinets. In 243d there were a pile of 6 year old disused worker nodes and racked worker nodes whose PDU had been damaged during one of our many power cuts over the last 12 months. In addition to this we found and rebuilt a Dell Rack and also we had a spare Nortel 5510 switch.




With the newly available space from the removal of the wall in 243d, we got a tile cut and deployed the rack. The rack connects back to the older Stack01 via a copper gigabit Ethernet connection. This deployment will give us up to approximately 100 job slots once they are fully configured.




by Mark Mitchell (noreply@blogger.com) at August 23, 2011 05:04 PM

Two Stacks are better than one

Leading on from the last post, we have also re-introduced a new test cluster. This infrastructure is housed within the same rack as our old worker nodes  but is completely independent of the production cluster. Supporting a Dell 8024F are 5 servers and a Dell 5000 series switch which are connected via an independent 1 gigabit fibre connection to the University's network.

The purpose of this cluster is to test IPv4/IPv6 dual stack connectivity for grid Services, the testing of switch based security mechanisms and SL6 NAT testing without fear of impacting the real cluster.

The IPv6 connectivity model testing will be in multiple phases which include:

* SLAAC
* IPv6 to IPv4 tunneling
* IPv6 Routing


This framework is designed to comply with the HEPIX IPv6 Project and to look at the possible connection models required by Tier-2s to utilise IPv6. Additionally, we will be testing a wide variety of Grid enabled applications and associated systems such as Nagios to investigate potential issues within a dual stack deployment.

More on this soon.

by Mark Mitchell (noreply@blogger.com) at August 23, 2011 05:03 PM

August 12, 2011

UKI-SCOTGRID-DURHAM, Durham, UK

Running at capacity again


... after the shutdown. Slightly delayed due to a coming back during a low point in Atlas work, which is now past us.

Here's a graph of data moved from our storage element, and you can probably pick out the rather subtle peak when the last batch of analysis traffic started (taking us up to capacity):


by Stuart Purdie (noreply@blogger.com) at August 12, 2011 05:22 PM

August 10, 2011

UKI-SCOTGRID-DURHAM, Durham, UK

Power startup, situation (hopefully) normal

The planned power work in the Kelvin Building was completed this morning and we have been transferred back to our proper power feed from the generators. The power startup went smoothly and the building has returned to normal.

The Scotgrid cluster was restarted after the power was seen to be stable and we came out of downtime at 2.20 pm. We will monitor our situation, but we hope that this power work will improve our stability over the coming months.

by David Crooks (noreply@blogger.com) at August 10, 2011 05:50 PM

August 03, 2011

UKI-SCOTGRID-DURHAM, Durham, UK

And after studying its behaviour, objectively and critically, we believe we have a reliable method (With apologies to Neil Fallon)

Since the last post on the blog we have implemented a series of measures on the network which were planned to be deployed during the next Cluster refresh.

Primarily, we have migrated elements of our core servers such as svr020, svr001 and svr008 to the new Dell switch infrastructure and have introduced a series of Link Aggregation Groups (LAGS) across the Dell estate to raise their backbone to a full 20 Gigabits per second intra switch. This has led to the decommissioning of the core 10 Gigabit interconnect into our old Nortel gateway, stack01and this has been replaced with another LAG between the Dell's and stack01. The reason behind this will become clear in the next post.

The main upshot of this part of the network upgrade is that we now can have greater control over the network services and monitoring running out of these servers such as SNTP and Gangli respectively. These can be fine tuned to a greater degree on the Dell environment to minimise the broadcast and Layer 2 multicast impact of these services.

However, that is not to say that the Nortel's are on the way out quite yet. Our Torque and Maui Server, svr016, still resides on older Nortel equipment in Stack02 which is currently connected to the new Dell infrastructure by a 10 Gig fibre. This link is occasionally saturating; we have decided to upgrade the link to 20 Gigabits by running a new multimode fibre between the two computer rooms, 141 and 243d. We also decided to implement Layer 2 QOS for Server016 to ensure that it got priority over all other cluster traffic within the stack and through the core network switches.

Therefore, we embarked on the re-configuration on the QOS parameters on Stack02. The complexity behind this lies not in the actual end configuration: effectively the mac address of svr016 is tracked across VLAN's 1 and 2 respectively to ensure that a Gold Quality of Service is met for any device wishing to speak to or be spoken to by Svr016. The real complexity is implementing this so that you don't disable the entire cluster attached to the network stack.

Earlier implementations of the Nortel OS had a nasty tendency to drop all non-specified traffic within the network, and the QOS policy generation, while incredibly granular in its ability to tag and filter traffic, involves 6 different stages to ensure that traffic is correctly tagged and forwarded.

Added to the fact that if the MAC address do not have the correct MAC address mask  all traffic generated by Svr016 will be dropped, effectively disabling the cluster for a period of time, a general picture of the care required to implement this feature developed on our part.

Sam and myself rechecked the configurations twice before attempting to implement them. However, when we attempted to commit  we discovered that the Nortel GUI is a lot more thorough in its checks than we could ever have imagined. Due to a mis-configuration of the MAC address mask the system refused to commit it to the switches. It even supplied an error message which identified that the mask was wrong.

Once the mask had been corrected the configuration was loaded onto stack02 and immediately started to work. The image below shows the packet matching since the 30th of June 2011.




Now for the real test. How would it cope under increased DPM traffic loads?



Surprisingly well: it turns out as now all traffic to and from svr016 has a low drop status and high precedence value across the network.

The images below show the system performance during one of this recent event.










As can be seen, there is no real increase in activity now as the QOS mappings for svr016 now mean that, while it is still part of the production and external VLANs it always travels 1st class.

The next phase of QOS development is to start to investigate the corralling of network broadcasts for services such as NFS to see if we can reduce the background chatter on the network without impacting service.









by Mark Mitchell (noreply@blogger.com) at August 03, 2011 10:21 PM

Circuits, Circuits everywhere but not a drop to switch

Since the late afternoon of the 26th of July we have been working to resume service on the Cluster at Glasgow.
We were put into unexpected downtime by our old friend; the power cut.

The root cause of this appears to be that the local mains supply into the site failed and was sub-sequentially re-instated. However, we decided to restart the cluster on Wednesday morning, to ensure that there was a clean and stable supply into the site. So off to the Gocdb, announce the unscheduled downtime and proceed.

While normally we would have immediately started on getting the cluster back online, as it turned out we couldn't have got ourselves back into production any sooner due to the residual issues caused by the power outage. As we have had several power interruptions at the site over the last 10 months, we have now got a reasonably robust restart procedure and we started this on Wednesday morning.

Initially, we had absolutely no issues surrounding the reset of both rooms, bar the loss of a rather expensive 10 Gig Ethernet interface on one of the new Dell Switches and the loss of the switch configuration files, which was caused by yours truly not running a copy run start on the switch after configuring a LAG group and QOS. We reconfigured the switch and all connectivity across the cluster was confirmed as good.

We then proceeded to rebuild our one of our internal stacks to free up the 10 Gig Interfaces on a Nortel 5530, which we had planned to move to our lower server room to build out the second 10 Gig link, mentioned in a previous post. This too went surprisingly well, but Dave and myself had pretested building the stack and adding and removing devices and inserting new base units on older test equipment.

We then retested again Stacking, LAGs were working fine, Spanning tree was happy and the Cluster's network was in good shape. We then moved to phase 2 of the upgrade which was to insert the 5530 switch into the switch stack in the downstairs server room. After we inserted the switch in the stack, it came up and the entire stack stabilised and then started to forward traffic.

However, about 3 minutes later we started to see the latency in the network rise and hosts fail to contact one another. Ping, SSH and normal cluster network traffic such as NFS, NTP and DNS also started to experience issues. We reduced the load on the network by detaching hosts from it but to no avail. We then removed the 5530 from the stack but the problem remained. Over the next 4 hours we tried a variety of tests which were all ending with either the dreaded Host Unreachable or 142 millisecond response times. To make matters worse (confusing), the switches were reporting an internal response time between room of 0.50 milliseconds via ping but telnet and ssh between devices was also timing out.

As we were unable to ascertain the exact root cause, we called a break and went and got some air.

20 minutes and one pizza slice later, it occurred  to me that if no device on the network was generating traffic at the volume required to generate a 94% packet loss scenario across multiple 10 Gig connections, then it has to be the network itself. Or rather what is attached to it.

The 10 Gig Interface being cooked wasn't the cause as it was dead at this point, but the power cut had left another present:

Damaged Ethernet Cables.

As the Cluster is too large to manually go round and check every cable individually with a line tester, we did something that I, as a former telco engineer, don't like doing. We rebooted the switches in numbered sequence. Starting with Stack01.

The purpose of this test is to isolate as quickly as possible the damaged cable, device or interface by pinging across the cluster from one room to another and intra switch if need be.

So Ping from Svr001 (upstairs) to Node141 (downstairs).
Destination Host Unreachable.
Leave the ping running.
Reboot Stack 01.
Ping response time of 0.056 miliseconds
Stack01 reloads.
Destination Host Unreachable.

We repeated this test twice. And got the same result.

So onto Stack01. The partner switch which trunks into this stack to affect an uplink onto the core of our network did not report any errors on the multi-link trunk but also very little traffic. Neither did Stack01, until I tried to ping its loopback address from the partner switch. The error rate on the interfaces increased and CRC counters were recorded. So we systematically disabled the multi-link trunk link by link until the stack interconnect stablised.

This reduced the trunk's capacity substantially but it also stabilised the network. So we added the 5530 back into the Stack downstairs, turned on the partner ports upstairs and were awarded with a 20 Gig backbone which is now operational at the Glasgow site.

As for the old LAG connection it was stripped out completely this morning and by early afternoon we had re-instated a 6 Gig connection to Stack01 which is working happily. From here we brought the site out of downtime and are back on the Grid.

We are putting in place an  internal tftp process for backing up switch configurations each night.

The main lesson from this is that on a large layer 2 environment, the smallest issue can become a major one and plans are well advanced on the next set of configuration changes to the network at Glasgow, to get around this and other potential issues in the future.






by Mark Mitchell (noreply@blogger.com) at August 03, 2011 10:20 PM

Side Effects may include ...

On Wednesday, the 25th, the Glasgow Scotgrid site was part of the wider SSC5 Security Challenge and during the course of the challenge we encountered several issues with the network security configuration on our core switch.

The configuration changes which caused issued are specifically:
1) Access List Configuration for inbound services
2) ICMP dos-control settings

The Access List Configuration (ACL) did not accept a global default permit with a wild card mask for both IP address ranges and subnets. The key issue here is that when the Access List was applied  on an access port for inbound traffic the Access List worked correctly. However, when applied to the primary egress port onto our network switch it disabled remote connectivity into the cluster, while not impacting internal  machine to machine traffic on the cluster.  The access list was removed and remote access was restored. The root cause for this failure was traced to an incorrectly set ACL ANY permit within the list, however on further investigation each network requiring access to and from the cluster will require its own unique entry rather than a default network range with a series of denied services.  The central IT group at the University also run a series of access lists and fire walls within the edge routing and switching network to the JANET environment which can be adapted to fit our requirements within the cluster setup at Glasgow.

A secondary issue;

A dos-control setting which controls the maximum payload for ICMP also caused unusual network behaviour after it was implemented. Effectively by limiting the payload to 512 bytes, this caused Maui and Torque to encounter issues when attempting to communicate with one another which then impacted other services within the cluster environment, while this slowed down Torque and Maui it did not completely stop the cluster, however its removal immediately improved data connectivity within the cluster. This issue is being referred back to the manufacturer as the payload incrementation only increases to 1023 bytes presently.

Once we have an update on this issue we will post it up on the blog.

by Mark Mitchell (noreply@blogger.com) at August 03, 2011 10:19 PM

A switch port too far

As part of the ongoing upgrades surrounding the recent issues that the CEs have had when communicating with svr016, we decided to upgrade the core backbone link to 20 Gigabits. Presently, we have one 10 Gigabit trunk link between 141 and 243d, which is occasionally saturating with traffic.

As previously posted, we disabled the 10 gigabit link into Stack01 and used the XFP GBIC recovered from it to facilitate this new link. Sam and I laid new fiber optic patch leads in both rooms to the patch panels and connected these to spare ports on the Core Dell 8024F and Stack02's 5530.

However, the link refused to come up. After several hours investigation we acquired a fiber optic line tester which proved that light was coming through the new link. We then tested the ports on both switches with a fiber optic loop.

While the port and GBIC in the 8024F looped correctly, you get a rather re-assuring green link light on the transmit and receive port, it failed on the port in Stack02. We retested the XFP in its old unit, stack01 and it came up correctly using the loop.

While we are using 62.5 um patch leads which, under the standards can't be driven as far as 50 um,  we thought this may have been the issue, we confirmed that this wasn't the case through the re-testing of all the components end to end with the fiber optic meter.

We cleaned out the interface slot on the Stack02 5530 with compressed air and isopropyl alcohol,  the port, while recognising the gbic correctly, did not bring up the link.  We fear that the on board optical interface is damaged, however we would need to put the site into downtime to confirm this, so we have come up with a Plan B.

As we have successfully built a LAG between 141 and 243d which is in place and did not impact service at all during its commissioning, and have laid in the fiber interconnect,  we have decided to investigate moving our second 5530 into Stack02 from Stack01 to give us the 20 Gigabit uplink that we require within the core of the network.

More on this after the move. 

As an aside, you never know how windy cold aisles are, until you lift a floor tile. Sam is on the floor in this image and not glued to the ceiling as his hair direction may imply.










by Mark Mitchell (noreply@blogger.com) at August 03, 2011 10:19 PM

Controlled Shut Down. Please standby.

As many regular readers of our blog may have noticed, we have had several power cuts over the last 8 months. While the Scot Grid Glasgow cluster has survived relatively well with these interruptions,  the School of Physics and Astronomy at the University of Glasgow has under taken a piece of work to resolve this re-current issue.

Therefore, on the 7th - 10th of August we will be going into a controlled downtime period so that the transformers which supply the mains feed into our site can be removed and upgraded.

We should be back in action on the morning of Wednesday the 10th.

by Mark Mitchell (noreply@blogger.com) at August 03, 2011 10:17 PM

July 14, 2011

UKI-SCOTGRID-DURHAM, Durham, UK

We make knowledge possible

Just a quick Blog post in regards to the the WLCG workshop held at DESY in Hamburg from the 11th to 13th of July.
The various presentations covered aspects of all the experiments  and the future requirements for systems, storage, monitoring and networks.
Links to the workshop agenda and content can be found here:
https://indico.cern.ch/conferenceDisplay.py?confId=124407

by Mark Mitchell (noreply@blogger.com) at July 14, 2011 04:23 PM

July 13, 2011

UKI-NORTHGRID-LIV-HEP, Liverpool, UK

cvmfs installation

Last week after few months delay I finally installed cvmfs. It's since 2002-2003 that I advocate the use of a shared file system for the input sandbox with locally cached data. AFS was successfully used in grid and non grid environment by BaBar users and is still used by local non-LHC users in Manchester for small work. So I'm pretty happy that a light weight caching file system is now available for more robust traffic. This is a really good moment to install cvmfs for two reasons:

1) Lhcb asked for it too.
2) Atlas has moved its condb files from the HOTDISK space token to cvmfs.

And it should reduce drastically errors for both NFS and SE load.

These are my installation notes:

* Install cernvm.repo: you can find it here or you can copy the rpms in your local and install from there. I distribute the file with cfengine but otherwise

cd /etc/yum.repos.d/
wget http://cvmrepo.web.cern.ch/cvmrepo/yum/cernvm.repo


* Install the gpg key: yum didn't like the key and was giving errors. I don't know if the problem is only mine (possible) I anyway told the developers and in the meantime I had to remove the key check from the repo file and trust the rpms. But if you want to try it, it might work for you:

cd /etc/pki/rpm-gpg/
wget http://cvmrepo.web.cern.ch/cvmrepo/yum/RPM-GPG-KEY-CernVM


* Install the rpms. In the documents there is an additional rpm cvmfs-auto-setup which is not really necessary and was also causing problems due to some migration lines devised for upgrades. Other than that it runs a setup and a restart command that can be run by your configuration tool of choice. S. Traylen also suggested to install SL_no_colorls to avoid ls /cvmfs mounting all the file systems that's why it's in the list.

yum install -y fuse cvmfs−keys cvmfs cvmfs−init−scripts SL_no_colorls

* Install configuration files. Below is what I added. For atlas there is in the docs a nightlies repository but that's not ready yet and isn't going to work. The default QUOTA_LIMIT set in default.local can be overridden in the experiment configuration. For each of this files there is a .conf file and a .local you should edit only .local. If they are not there just create them.
You need to override the CVMFS_SERVER_URL for atlas otherwise you don't get the new setup. While in cern.ch.local I simply inverted the order of the server to get RAL first and then the other two if RAL fails. I also removed CERNVM_SERVER_URL which appears in cern.ch.conf otherwise it goes to CERN first even though it's not apparently defined anywhere.

/etc/cvmfs/default.local
CVMFS_REPOSITORIES=atlas,atlas-condb,lhcb
CVMFS_CACHE_BASE=/scratch/var/cache/cvmfs2
CVMFS_QUOTA_LIMIT=2000
CVMFS_HTTP_PROXY="http://[YOUR-SQUID-CACHE]:3128"

/etc/cvmfs/config.d/atlas.cern.ch.local
CVMFS_QUOTA_LIMIT=10000
CVMFS_SERVER_URL=http://cvmfs-stratum-one.cern.ch/opt/atlas-newns

/etc/cvmfs/config.d/lhcb.cern.ch.local
CVMFS_QUOTA_LIMIT=5000

/etc/cvmfs/domain.d/cern.ch.local
CVMFS_SERVER_URL="http://cernvmfs.gridpp.rl.ac.uk/opt/@org@;http://cvmfs-stratum-one.cern.ch/opt/@org@;http://cvmfs.racf.bnl.gov/opt/@org@"
CVMFS_PUBLIC_KEY=/etc/cvmfs/keys/cern.ch.pub


* Create the cache space. By default it's in /var/cache. However I moved it to the /scratch partition which is bigger.

mkdir -p /scratch/var/cache/cvmfs2
chown cvmfs:cvmfs /scratch/var/cache/cvmfs2
chmod 2755 /scratch/var/cache/cvmfs2


* Run the setup. These are the commands the cvmfs-auto-setup would run at installation time. They also configure fuse although that's only one line added to fuse.conf.

/usr/bin/cvmfs_config setup
service cvmfs restartautofs

chkconfig cvmfs on
service cvmfs restart


* Some parameters need to change for squid. Below is what the documentation suggests. I tuned it to the size of my machine. For example the maximum_object_size and cache_mem were too big and I checked which other parameters were already set to evaluate if it was the case to change them.

collapsed_forwarding on
max_filedesc 8192
maximum_object_size 4096 MB
cache_mem 4096 MB
maximum_object_size_in_memory 32 KB
cache_dir ufs /var/spool/squid 50000 16 256


* Apply changes for Lhcb the VO_LHCB_SW_DIR needs to point to cvmfs. You can change it in YAIM and rerun it or you can do as I've done (still making sure to change YAIM so that freshly installed nodes don't need this hack). Lhcb with this change is good to go.

sed -i.sed.bak 's%/nfs/lhcb%/cvmfs/lhcb.cern.ch%' /etc/profile.d/grid-env.sh
mv /etc/profile.d/grid-env.sh.sed.bak /root


* Apply changes for Atlas. A similar change to VO_ATLAS_SW_DIR is required and you need to set an additional variable that is not handled by YAIM. For now I added it to grid-env.sh but it be better placed in another file not touched by YAIM or a snippet should be added to YAIM to handle the variable. This is enough for the jobs to start using the software area. However you still have to contact the atlas sw team to do their validation tests and enable the condb use. They'll propose a long way and a short way. I took the short because I didn't want to go in downtime and jobs were already running using the new setup.

sed -i.sed.2 's%"/nfs/atlas"%"/cvmfs/atlas.cern.ch/repo/sw"\ngridenv_set "ATLAS_LOCAL_AREA" "/nfs/atlas/local"%' /etc/profile.d/grid-env.sh
mv /etc/profile.d/grid-env.sh.sed.bak /root


* Always for Atlas remove some installed .conf files which install a link in /opt which is not necessary anymore. Second file might not exist, but there is an atlas-nightly.cern.ch.conf. This will surely change in future cvmfs releases.

service cvmfs stop
rm /etc/cvmfs/config.d/atlas.cern.ch.conf
rm /etc/cvmfs/config.d/atlas-condb.cern.ch.conf
service cvmfs start


Update 12/7/2011: Using YAIM

cfengine only installs the rpms and the configuration files (*.local). All the rest is now carried out by a YAIM function I created (config_cvmfs). I put a tar file here.To make it work I also added a node description in node-info.d/cvmfs (also in the tar file) that contains it. In this way I don't have to touch any already existing YAIM files and I can just add -n CVMFS to the YAIM command line we use to configure the WNs. It requires ATLAS_LOCAL_AREA and CVMFS_CACHE_DIR variables to be set in your site-info.def.

CVMFS docs are here

Release Notes
Init Scripts Overview
Examples
Technical Report
RAL T1
Atlas T2/T3 setup
Atlas latest changes

by Alessandra Forti (noreply@blogger.com) at July 13, 2011 11:10 PM

July 12, 2011

UKI-NORTHGRID-LIV-HEP, Liverpool, UK

How to remove apel warnings and avoid nagios alerts

Quite few sites have few entries in APEL that don't quite match. They can appear with two messages

OK [ Minor discrepancy in even numbers ]
WARN [ Missing data detected ]


They don't look good on the Sync page and nagios also sends alerts for this problem which is even more annoying.

The problem is caused by few records with the wrong time stamp (StartTime=01-01-1970). These records need to be deleted from the local database and the period were they appear republished with the gap publisher. To delete the records connect to your local APEL mysql and run:

mysql> delete from LcgRecords where StartTimeEpoch = 0;

Then for each month were the entries appear rerun the gap publisher. And finally rerun the publisher in missing records mode to update the SYNC page or you can wait the next proper run if you are not impatient.

Thanks to Cristina for this useful tip she gave me in this ticket.

by Alessandra Forti (noreply@blogger.com) at July 12, 2011 12:46 AM

July 11, 2011

UKI-SCOTGRID-DURHAM, Durham, UK

Everyone's doing a brand new filesystem now: Come on, baby, do the cvmfs now.

Ever since I heard about it at CHEP 2010, I've been itching to get CVMFS set up at Glasgow, because it was so clearly a better solution for software provision than the old sgm-role / NFS-mounted area approach.
Concerns about the reliability of the hardware that the service was running on (it may still not be on production hardware at CERN as I write this) always held the more sensible minds here back, but now that it's all up and working at RAL, and RAL is providing a stratum-1 cache as a backup, there's nothing stopping us.

So, following a combination of Ian Collier's description of the set-up at RAL and the official CernVMFS technical report (pdf), with some adjustments to make changes to our Cfengine config, I spent some of last week getting cvmfs working on the cluster.

For your edification, this is what I did:

1) First, set up the new repository you need. In our case, yum repositories (and gpg keys) are managed by cfengine, so, in our cfengine skel directory for the worker nodes, I added:

wget http://cvmrepo.web.cern.ch/cvmrepo/yum/cernvm.repo -P ./skel/workers/etc/yum.repos.d/
wget http://cvmrepo.web.cern.ch/cvmrepo/yum/RPM-GPG-KEY-CernVM -P ./skel/workers/etc/pki/rpm-gpg/


2) Fuse and cvmfs both want to have user and group entries created for them. We manage users and groups with cfengine, so I added a fuse group to /etc/groups and a cvmfs user and group. The cvmfs user also needs to be added as a member of the fuse group. 

3) Now that the initial set-up bits are done, the new packages can be installed, again, using cfengine. I added the packages
fuse ; fuse-libs ; cvmfs ; cvmfs-keys ; cvmfs-init-scripts

to the default packages for our worker node class in cfengine.

4) Editing configuration files.
You need to edit auto.master to get autofs to support cvmfs.
(Just add a line like

/cvmfs /etc/auto.cvmfs
as the auto.cvmfs map is added by the cvmfs rpm.
Remember to issue a:
service autofs reload
afterwards, or get your configuration management system to do so automagically for you.
)
You also need to configure fuse to allow users to access things as other users:
/etc/fuse.conf
user_allow_other
And finally, you need to actually configure cvmfs itself. Cvmfs uses 2 main configuration files:
default.local, which specifies modifications of the default settings for the local install
cern.ch.local, which specifies modifications of the default server to use for *.cern.ch repositories.

/etc/cvmfs/default.local needs to be configured for:


CVMFS_USER=cvmfs
CVMFS_NFILES=32768
#CVMFS_DEBUGLOG=/tmp/cvmfs.log
CVMFS_REPOSITORIES=atlas.cern.ch,atlas-condb.cern.ch,lhcb.cern.ch,cms.cern.ch,geant4.cern.ch,sft.cern.ch
CVMFS_CACHE_BASE=/tmp/cache/cvmfs2/
CVMFS_QUOTA_LIMIT=10000
CVMFS_HTTP_PROXY="nameoflocalsquid1|nameoflocalsquid2"


/etc/cvmfs/cern.ch.local, for UK sites should probably be configured as:


CVMFS_SERVER_URL="http://cernvmfs.gridpp.rl.ac.uk/opt/@org@;http://cvmfs-stratum-one.cern.ch/opt/@org@"


(since RAL is closer to us than CERN).

A brief note: ';' in a list of options specifies failover, and '|' load-balancing. So "foo;bar" means "try foo, then bar", while "foo|bar;baz" means "try to load-balance queries between foo and bar, if that fails, try baz". This works for the squid proxy specifiers in default.local and also the server destinations in cern.ch.local .

Another note: the cache directory specified in default.local should be large enough to actually cache a useful amount of data on each worker node. 10Gb per VO is reported to be comfortably enough, for atlas and lhcb, and therefore is probably wildly exorbitant for any other VO that would be using it. I've tested, and you can happily set this directory to be readable only by the cvmfs user, which gives you a tiny bit more security.

If you change the configuration files for cvmfs, you need to get it to reload them, like autofs.

service cvmfs reload

seems to work fine (and our cfengine config now does this if it has to update those config files).

In our case, I created the two config files, stuck them in the skel directories for worker nodes in cfengine, and added them to the list of files that are expected to be on worker nodes in the config.



5 ) You can check that all this is working by trying a service cvmfs probe
or explicitly mounting a cvmfs path somewhere outside of automount's config.
With the default config, atlas software is at /cvmfs/atlas.cern.ch and so on.

by Sam Skipsey (noreply@blogger.com) at July 11, 2011 03:40 PM

July 01, 2011

UKI-SCOTGRID-DURHAM, Durham, UK

The Grid is a hungry, hungry beast....

... and it eats networks. From here begins a long, convoluted story, ending, as these often do, in something that has something that seems like it should have been obvious.

We've been noticing some 'blips', during which Maui fights bravely but ultimately fails to schedule jobs. This is generally considered rather sub-optimal.

The root of it was Maui was failing with an error:

ERROR:    cannot get node info: Premature end of message


That Maui error results in Maui taking a break for 15 minutes, before trying to schedule anything again. Which is fair enough, in the face of communication errors. Only ... Maui doesn't speak to anything except the Torque server. Which is running on the same host.

So what's actually happening here is that Torque can't talk to some node or other, and reporting that to Maui, which is then breaking. It didn't seem right that a communication failure to a single node once should stop jobs from starting elsewhere, which prompted some deeper investigation.

Looking for obvious correlations, we noticed that the scheduling blips happened right when we're running lots of analysis jobs - exactly when we don't want scheduler blips! However, it wasn't an obvious correlation, in that sometimes running 1000 jobs at once was fine, other times 400 caused things to gum up.

More worry-some than sub-optimal scheduling was that during the same time period we got occasional errors from the CE's, of the form:

BLAH error: submission command failed (exit code = 1) 
(stdout:) 
(stderr:pbs_iff: cannot read reply from 
pbs_server-No Permission.-qsub: 
cannot connect to server svr016.gla.scotgrid.ac.uk 
(errno=15007) Unauthorized Request


Dissecting that down, the BLAH part is CREAM saying it can't submit the job, so we're looking at the pbs_iff part. The purpose of pbs_iff is to authenticate the current user to the Torque server, so that the job is run with the correct user id (and can be checked with the ACL's on the server, if appropriate). The next part with qsub is just reporting that it's not able to talk to the server.

The root problem is pbs_iff not able to communicate, after which the rest of the qsub is failing for lack of authentication. This is a problem, because these are jobs that are already accepted by the CREAM CE, and shouldn't be failed here. (If a site can't cope with the jobs, the CE should be disabled, so it never accepts the jobs - that's the signal to the submitter/WMS to try elsewhere.)

How does all this link back to the network issues? Well, our cluster is split into two rooms - liked by a couple of fibres.

During analysis, we can see 2 GB per second (yes, that's in bytes) in traffic leaving the disk servers. Roughly half the disk and about half of the CPUs [see later!] are in each room; that implies that given a random distribution half that traffic has to pass through the fibre link.

And, yep, that's the problem right there. The Torque server unable to shout loud enough to talk to the nodes when the link is full, or be heard from some of the CE's. Digging into the stats shows that the link is running at 83% average utilisation, over the past month. So when analysis hits, it wipes out any other traffic.

For the moment, then, I've put a cap on the number of analysis jobs until we can resolve this, as mitigation. And sent Mark off to find some more fibre and ports on the switches!

Some interesting sums: Turns out we have nearer 1/3 the CPU upstairs, and 2/3 (1200 job slots) downstairs. Disk is close to 1/2 each. Matching this up with the planning number of 5 MB per second 'disk spindle to analysis cpu' bandwidth suggests that we need 3 GB per second, or 24 Gbs-1 bandwidth between the rooms to run at full capacity. Compared to 10 Gbs-1 at the moment.

Hrm. No wonder we were having difficulty! On the other hand, it's probably been this link that's the limiting factor in our analysis throughput, so we should be able to roughly double our peak throughput of analysis jobs once that link is upgraded.

That, and not have the scheduler taking a wee nap during peak times.

by Stuart Purdie (noreply@blogger.com) at July 01, 2011 05:02 PM

June 23, 2011

UKI-NORTHGRID-LIV-HEP, Liverpool, UK

DPM optimization next round

After I applied 3 of the mysql parameters changes I talk about in this post I didn't see the improvement I was hoping with atlas jobs time outs.

This is another set of optimizations I put together after further search

First of all I started to systematically count the time TIME_WAIT connections every five minutes. I also correlated them in the same log file to the number of concurrent threads the server keeps mostly in sleep mode. You can get the last bit running mysqladmin -p proc stat or from within a mysql command line. The number of threads was near to the max allowed default value in mysql so I doubled that in my.cnf

max_connections=200

then I halved the kernel time out for TIME_WAIT connections

sysctl -w net.ipv4.tcp_fin_timeout=30

the default value is 60 sec. If you add it to /etc/sysctl.conf it becomes permanent.

Finally I found this article which explicitly talks about mysql tunings to reduce connection timeouts: Mysql Connection Timeouts and I set the following

sysctl -w net.ipv4.tcp_max_syn_backlog=8192
sysctl -w net.core.somaxconn=512


again add to /etc/sysctl.conf to make it permanent; and added in my.cnf

back_log=500

I calculated my numbers on 500 connections/s because that's what I have observed when I did all this (I obeserved even larger numbers). Admittedly now they are stable at 330 connections per second but we haven't had any heavy ramp up since Saturday. Only a mild one but that didn't cause any time out. I'm waiting for a serious ramp as definitive test. Said that since Saturday we haven't seen any timeout errors not even the low background that was always present. So there is already an improvement.

Update 16/06/2011

Today there was an atlas ramp from almost 0 to >1400 jobs and no time outs so far.

Few timeouts were seen yesterday but they were due to authentication between the head node and a couple of data servers which I will have to investigate but they are a handful, nowhere near the scale observed before and not due to mysql. I will still keep things under observation for a while longer. Just in case.

by Alessandra Forti (noreply@blogger.com) at June 23, 2011 11:22 AM

June 11, 2011

UKI-NORTHGRID-LIV-HEP, Liverpool, UK

DPM Optimization

My quest to optimize DPM continues. Bottlenecks are like Russian dolls and hide behind each other. After optimizing the data servers increasing the block device read ahead; enabling lacp on network channel bonding and multiplying the atlas hotdisk files there is still a problem with mysql on the head node which causes time outs.

When atlas ramps up there is often a increase of connection in TIME_WAIT. I observed >2600 at times. The mysql database becomes completely unresponsive and causes the time outs. Restarting the database causes the connections to finally close and the database to resume normal activity. Although a restart might alleviate the problem as usual it's not a cure. So I went on a quest. What follows might not alleviate my specific problem, I haven't tested in production yet, but it certainly helps with another: DB reload.

Sam already wrote some performance tuning tips here: Performance and Tuning most notably the setting of innodb_buffer_pool_size. After a discussion on the DPM user forum and some testing this is what I'd add:

I set "DPM REQCLEAN 3m" when I upgraded to DPM 1.7.4 and this, after a reload, has reduced Manchester DB file size from 17GB to 7.6GB. Dumping the db took 7m34s. I then reloaded it with different combinations of suggested my.cnf innodb parameters and the effects of some of them are dramatic.

The default parameters should definitely be avoided. Reloading a database with the default parameters takes several hours. Last time it took 17-18 hours, this time I interrupted after 4.

With a combination of the parameters suggested by Maarten the time is drastically reduced. In particular the most effective have been setting innodb_buffer_pool_size and innodb_log_file_size. Below are the results of the upload tests I made in decreasing order of time. I then followed Jean Philippe suggestion to drop the requests tables. Dropping the tables took several minutes and it was slightly faster with a single db file. After I dropped the tables and the indexes ibdata1 size dropped to 1.2GB and using combination 4 below it took 1m7s to dump and 5m7s to reload. With one file per table configuration reloading was slightly faster but after I dropped the requests tables there was no difference and it is also balanced by the fact that deletion seems slower and the effects are probably more visible when the database is bigger so these small tests don't give any compelling reason in favour nor against for now.

This are steps that help reducing the time it takes to reload the database:

1) Enable REQCLEAN in shift.conf (I set it to 3 months to comply with security requirements.)
2) set innodb_buffer_pool_size in my.cnf (I set it at 10% of the machine memory and I couldn't see much difference eventually when I set it to 22.5% but in production it might be another story with repeated queries for the same input files)
3) set innodb_log_file_size in my.cnf (didn't give much thought to this, Maarten value of 50MB seemed good enough. Binary log files need to be removed to enable this and the database restarted but check the docs this might not be a valid strategy if you make heavier use of the binary logs.)
4) set innodb_flush_log_at_trx_commit = 2 in my.cnf (although this parameter seems less effective during reload it might be useful in production 2 is slightly safer than 0).
5) Use the script Jean-Philippe gave me to drop the requests tables before an upgrade.

Hopefully they will help stop also the time outs.

Tests:

COMBINATION 1

innodb_buffer_pool_size = 400MB
# innodb_log_file_size = 50MB
innodb_flush_log_at_trx_commit = 2
# innodb_file_per_table

real 167m30.226s
user 1m41.860s
sys 0m9.987s

============================
COMBINATION 2
innodb_buffer_pool_size = 900MB
# innodb_log_file_size = 50MB
# innodb_flush_log_at_trx_commit = 2
# innodb_file_per_table

real 155m2.996s
user 1m40.843s
sys 0m9.935s

===========================
COMBINATION 3
innodb_buffer_pool_size = 900MB
innodb_log_file_size = 50MB
# innodb_flush_log_at_trx_commit = 2
# innodb_file_per_table

real 49m2.683s
user 1m39.137s
sys 0m9.902s
===========================
COMBINATION 4
innodb_buffer_pool_size = 400MB
innodb_log_file_size = 50MB
innodb_flush_log_at_trx_commit = 2 -- test also with 0 instead of 2 but it didn't change the time it took and 2 is slightly safer
# innodb_file_per_table

real 48m32.398s
user 1m40.638s
sys 0m9.733s
===========================
COMBINATION 5
innodb_buffer_pool_size = 900MB
innodb_log_file_size = 50MB
innodb_flush_log_at_trx_commit = 2
innodb_file_per_table

real 47m25.109s
user 1m39.230s
sys 0m9.985s
===========================
COMBINATION 6
innodb_buffer_pool_size = 400MB
innodb_log_file_size = 50MB
innodb_flush_log_at_trx_commit = 2
innodb_file_per_table

real 46m46.850s
user 1m40.378s
sys 0m9.950s
===========================

by Alessandra Forti (noreply@blogger.com) at June 11, 2011 05:51 PM

June 01, 2011

UKI-SCOTGRID-DURHAM, Durham, UK

EGEE to EGI

We were recently asked to make sure that we were tagging our site as belonging to EGI and EGEE since the latter project has been ended for some time. This would typically involve changing a line entry in our site-info.def file and rerunning YAIM on the appropriate servers. However, as rerunning YAIM is a complete reconfiguration of a service, we decided to look into the exact alteration required to ensure that there was a low impact for the change.

As of June 2011, using a glite installation, the information that is published through the site bdii is stored in the /opt/glite/etc/gip/ldif directory on each server (this would be different using an EMI installation). The exact files that are in that directory depend on the type of service that is publishing, but in this case we're interested in the glite-info-site.ldif file which is on the site bdii itself. We have (or had) 3 entries mentioning EGEE:

GlueSiteOtherInfo: EGEE_ROC=UK/I
GlueSiteOtherInfo: EGEE_SERVICE=prod
GlueSiteOtherInfo: GRID=EGEE

Of these, we have updated

GlueSiteOtherInfo: GRID=EGEE

to

GlueSiteOtherInfo: GRID=EGI

and restarted the site bdii. After a small wait for the update to appear, we are now appropriately tagged as belonging to EGI as opposed to EGEE. Discussions are now underway as to the appropriate values for the other two variables.

In the site-info.def file itself (which should be updated to make sure that a future run of YAIM on the site BDII does not reverse this change) the corresponding change in our case is:

SITE_OTHER_GRID="EGEE|WLCG|SCOTGRID|GRIDPP"

to

SITE_OTHER_GRID="EGI|WLCG|SCOTGRID|GRIDPP"

For more information see https://wiki.egi.eu/wiki/MAN01

by David Crooks (noreply@blogger.com) at June 01, 2011 01:07 PM

May 20, 2011

UKI-NORTHGRID-LIV-HEP, Liverpool, UK

BDII again

A couple of weeks ago I upgraded the site BDII and top BDII from a very old version without reinstalling as described in this post. Few days ago I noticed that not all was working as well as I thought and the BDII was reporting stale numbers in the dynamic attributes causing few problems among which biomed submitting an unhealthy 12k jobs.

There were two reasons for this:

1) the unprivileged user that runs the BDII is edguser anymore but ldap. Consequently there were some ownership issues in /opt/glite/var subdirectories and files. This was highlighted in /var/log/bdii/bdii-update.log by permission denied errors which I overlooked for a bit too long. Permissions should be as follow: /opt/glite/var /opt/glite/var/lock, /opt/glite/var/tmp and /opt/glite/var/cache should belong to root and anything below them should belong to ldap. You can check if there is anything that doesn't belong to ldap running

find /opt/glite/var/ ! -user ldap -ls


this will include the top directories above which you can ignore.

2) bdii-update doesn't use anymore glite-info-wrapper and glite-info-generic which used to write the .ldif files in the same directory tree above. It now writes what it needs in /var/run/bdii databases and one unique file new.ldif file calling directly the scripts in /opt/glite/etc/gip/provider and /opt/glite/etc/gip/plugin. I upgraded from an older version and the old providers weren't deleted but continued to be executed by bdii-update. Some of them still read what now are obsolete .ldif. files under /opt/glite/var/cache tree. I deleted all the .ldif files with an additional numeric extension under /opt/glite/var.

With these two changes, i.e. fixing the ownership of the directories and deleting osolete .ldif files (or the old providers if one is sure of which ones) the site bdii restarted to update correctly the dynamic attributes.

Finally a note on making it easier to reinstall: in the previous post I suggested to add manually SLAPD=/usr/sbin/slapd2.4 to change slapd version to the newly installed /opt/bdii/etc/bdii.conf. However an easier way to maintain the service in case it needs reinstallation is to add SLAPD=/usr/sbin/slapd2.4 to site-info.def so that when YAIM runs it gets added to /etc/sysconfig/bdii and doesn't need a manual step is the machine is reinstalled.

by Alessandra Forti (noreply@blogger.com) at May 20, 2011 09:05 AM

BDII follow up

To decrease the need of restarting the BDII and following the discussion on tb-support I decided to upgrade to openldap2.4. Since I was at it I also updated both glite-BDII_site and glite-BDII_top (below the list of new rpms) to the latest repositories division since we still had the older common glite-BDII repo. The newest version of BDII has also new paths for most things. For example some config files have been moved to /etc/bdii and /var/run/bdii is the new SLAPD_VAR_DIR. The setting up of the repos are peculiar to Manchester where we mirror a latest version every day but the machines pick up from a stable repository that is updated when needed.

1) rsync glite-BDII_site and glite-BDII_top from Glite-3.2-latest to Glite-3.2 stable

2) Added the rpm to the local external repository from the BDII_top RPMS.external dir so it can be picked up also by BDII_site and if the case also CEs and SE.

3) Create new repo files and added them to cvs

4) Edited cf.yaim-repos to copy them

5) Installed manually (yum install) the rpms openldap2.4 openldap2.4-servers and their dependencies lib64ldap2.4 openldap2.4-extraschemas on BDII_site. In the glite-BDII_top case they are called in as dependencies so there is no need for this.
# This step can be added in cfengine at a later stage if needed.

6) mv /opt/bdii/etc/bdii.conf.rpmnew /opt/bdii/etc/bdii.conf
# Contains the pointer to the new bdii-slapd.conf which contains the new paths. bdii/slapd won't restart with the old bdii.conf.

7) Add SLAPD=/usr/sbin/slapd2.4 to the new /opt/bdii/etc/bdii.conf
# This can go in yaim post function if one really wants.

8) Rerun YAIM

9) Reduced the rate the cron job checks the bdii from 5 to 20 mins. Top bdii seemed to take longer to rebuild probably due to an expired cache causing a loop.

Crossing fingers it will work and stop the BDII periodically hanging.

New Site BDII RPMS

bdii-5.1.22-1
bdii-config-site-0.9.1-1
glite-BDII_site-3.2.11-1.sl5
glite-yaim-bdii-4.1.12-1

New Top BDII RPMS

bdii-5.1.22-1
bdii-config-top-0.0.9-1
glite-BDII_top-3.2.11-1.sl5
glite-yaim-bdii-4.1.12-1

Openldap2.4 RPMS

lib64ldap2.4_2-2.4.22-1.el5
openldap2.4-2.4.22-1.el5
openldap2.4-extra-schemas-1.3-10.el5
openldap2.4-servers-2.4.22-1.el5

UPDATE 20/

by Alessandra Forti (noreply@blogger.com) at May 20, 2011 08:01 AM

May 06, 2011

UKI-SCOTGRID-DURHAM, Durham, UK

Arc and lcmaps

Last time I was talking about Arc, I mentioned that there was an issue with LCMAPS, relating to the bitness of the available libraries.

And that once a 64 bit LCMAPS library was available, that'd be it.

Well, as you might have infered from a very slight delay, there's just a teensy bit more to it than that.

64 bit libraries are now common place, and did, indeed resolve the problem we had. However, they just turned up more problems.

Cue one long, and rather frustrating search down the rabbit hole of shared library dependencies. The root problem was that nothing was defining a symbol 'getMajorVersionNumber()', or the minor or patch number versions. Finding what _should_ be doing that, and what those values aught to be was the tricky part. Perhaps that's more a symptom of my not having spent very much time debugging shared library issues, rather than a sign of a genuinely hard problem.

In the end, it's a known problem with the VOMS libraries, and it's not hard to correct for it in the small scale, by adding stub methods that return 0 in the application code, and compiling with -rdynamic.

However, translating that into something that works for ARC is non-trival. Recompling all of AREX to export functions to shared libraries is asking for trouble, given the size of the thing. It's also debatable whether it's the right thing to do to work around what's really a bug in the libraries themselves.

Fortunately, there is another option. Arc can call plugins to do pool account mapping, and these are small external programs. So writing a short wrapper around LCMAPS is straight forward, and then Arc delegates responsability to this plugin, which is a nice, self contained place to have the workarounds.

My version of such a plugin is here, and should be identified in the arc.conf as
unixgroup=mapplugin 5 arc-lcmap %D %P

This now lets us use the same pool account mapping and authorisation infrastructure with both gLite and Arc. In particular, this lets us open up the Arc CE to any of our normally supported VO's; as a option for them to explore. That's a topic I'll be working with some VO's on over the summer.

For the moment though, I need to dismantle the layer of auth systems hacks we were using for Arc.

by Stuart Purdie (noreply@blogger.com) at May 06, 2011 03:06 PM

May 04, 2011

UKI-NORTHGRID-LIV-HEP, Liverpool, UK

Check BDII script updated

Yesterday the top BDII stopped working rather than the site BDII. It crashed. The pid file was still there but the process was not running.

So I adjusted the script to use a different query that works on all levels of bdii (resource, site, top) looking for o=infosys rather than o=grid and some specific attribute.

I also looked at the bdii startup script and it does a good job at cleaning up processes and lock/pid files in the stop function so I just use service bdii restart whether the process is there or not only the alert remains different in the two cases.

New version is still in

http://www.sysadmin.hep.ac.uk/svn/fabric-management/processes/monitoring/testbdii.sh

by Alessandra Forti (noreply@blogger.com) at May 04, 2011 11:41 AM

April 12, 2011

UKI-SCOTGRID-DURHAM, Durham, UK

Scotgrid goes East




We are currently attending the EGI User Forum in Vilnius, where we will be presenting on the Earth Sciences work being conducted at the Glasgow site. There is also a blog of the various events going on in the conference here.

The main themes are around virtualisation, software deployment and most importantly the user community interaction with the Grid.

by Mark Mitchell (noreply@blogger.com) at April 12, 2011 08:31 AM

April 08, 2011

RAL-LCG2, Oxford, UK

Record number of jobs running @ RAL

This week we deployed approx 100 new batch workers, each with 12 cores. Today I have seen the greatest number of jobs ever running on the RAL tier1  batch farm – 5811.

There are a few batch workers unavailable and there are some high memory jobs running but I reckon we could achieve 6000 running jobs in ideal circumstances.  So 6000 running jobs is my new goal for the tier1.

by johnkelly at April 08, 2011 02:34 PM

April 06, 2011

UKI-NORTHGRID-LIV-HEP, Liverpool, UK

Sharing scripts

in my Northgrid talk at GridPP I pointed out we all do the same things but in a slightly different way I thought it'd be good to resume the thread on sharing management/monitoring tools. I always thought building a repository was a good thing and I still do.

I think the tools should be as generic as possible but do not need to be perfect. Of course if scripts work out of the box it's a bonus but they might be useful also to improve local tools with additional checks one might not have thought about.

I'll start with a couple of scripts I rewrote last Monday to make them more robust:

-- Check the BDII:

http://www.sysadmin.hep.ac.uk/svn/fabric-management/processes/monitoring/testbdii.sh

The original script was checking a network connection exist if it didn't exist it restarted the bdii service.

The new version checks the slapd is responsive, if it isn't checks if there is a hung process, if there is it kills it and restarts the bdii, if there isn't just restarts the bdii.

-- Check Host Certificate End Date:

http://www.sysadmin.hep.ac.uk/svn/fabric-management/certificates/x509/check-host-cert-date.sh

The old version was just checking if the certificate was expired and sent an alert. Not very useful in itself as it picks the problem when the damage is already done.

The old version checks that, because it might be useful if machines have been down for a while, and also it starts to send alerts 30 days before the expiration date. Finally if the certificate is not there it asks the obvious question should you be running this script on this machine?

by Alessandra Forti (noreply@blogger.com) at April 06, 2011 11:43 AM

March 29, 2011

RAL-LCG2, Oxford, UK

Whole Node jobs – Implementation and aftermath

In my two previous posts I’ve documented the investigations and the behind the scenes maneouvering we’d made to allow us to implement whole node jobs. Now we were finally ready to roll out the configuration.

The changes to the batch system to set up the new queue with the new resource requirements, went in smoothly. For the CEs we had no issues with our CREAM CEs, but the lcg-CE gave us a problem, the lcgpbs job manager was adding a requirement of just 1 node, and this requirement was overriding the default requirement set in the queue. We’d missed this in testing, as we had no test lcg-CE, only CREAM CEs. We managed to fix the issue by editing the submit filter to add the 8 processors per node requirement for jobs submitted to the whole node queue. Astute readers may remember this was something I’d hoped to avoid, but I tolerated this hack, as it was only for the lcg-CE which we are phasing out in favour of the CREAM CE.

Other than that small issue, things looked good, so I submitted some dteam jobs to test – we’d reserved 4 nodes for the whole node jobs, but had a limit of 5 jobs to see how feasible it was to schedule whole node jobs ‘in the wild’ alongside the standard experment jobs. I then loaded it up with a small number of dteam jobs to see what would happen.

What we saw was a “suspiscously coincident” drain of running jobs when jobs were queued in the whole node queue – this can been seen in the image to the left, which shows running jobs over the whole farm, with the number of jobs in whole node queue for the same time superimposed on top, when there are queueing jobs in the whole node queue (the paler green above the black line) there is a corresponding decrease in the jobs running across the entire farm.

Obviously, we had to understand why this was happening, before we released the queue to the experiments, so we postponed the publishing of the queue until this issue was understood.

After some testing, we had a plausible explanation – Our job priorities were calculated purely from fairshares; with no other factors, as dteam ran few jobs, they were coming out of the fairshare calculation with the highest priority. By default, Maui only allocates a priority allocation to the highest priority job, other jobs could only be scheduled to run if they could be used as backfill, before the highest priority job could start. This had never been a significant issue until now as to date all jobs had been single core jobs and thus able to run on any free batch slot, but the whole node jobs would not necessarily be able to run immediately. This meant that the other job could only be used as backfill but the only information we had on job lengths, were the default lengths set on the queue, which were fairly uniform across all queues. This meant that maui would not start the other jobs as maui believed that they would not finish before the whole node job could have started, which lead to a farm drain until the whole node job was running (or deleted by a panicing admin wondering what on earth was going on).

Now we had an explanation for the problem, we had to decide how to fix it. One solution would have been to revise how our priorities would are calculated, however this would be a somewhat involved task to find a reasonable solution. Instead we chose to increase the RESERVATIONDEPTH configuration setting to 2. This is the setting that controls how many jobs get this priority allocation, we combined this with a MAXIJOBS setting of 1 for the whole node queue – this would mean that only 1 job from the whole node queue could be considered to run during a single scheduler iteration, and thus that the second priority reservation slot is always to run single core jobs.

After we’d done this, all that was left was to publish the queue. To avoid normal jobs landing up in the queue we had decided to publish the queue with the GlueCEStateStatus set to Special, but even with this set, we still had some monitoring jobs landing in the whole nodes queue, so we changed it to WholeNode, which seems to work better.

With that, I was able to let the vos know that our whole node queue was ready for business.

by Derek Ross at March 29, 2011 03:26 PM

March 23, 2011

UKI-SCOTGRID-DURHAM, Durham, UK

ScotGrid Reloaded

As it is spring, we have decided to revamp the blog.
We will be updating the blog over the next couple of weeks and tinkering with the layout.
Please Stand By.

by Mark Mitchell (noreply@blogger.com) at March 23, 2011 01:12 PM

Spanning Tree, oh Spanning Tree

Following last week's power outages we were encountering issues with Spanning Tree reconvergence on our older switching equipment. The Nortel 5510 and 5530 switches which have been stalwarts of the Glasgow cluster install were experiencing a major rise in the number of BPDU's being transmitted, since the second power outage as well as an increase in the number of dropped packets across all interfaces. The cause of these two issues are partially inter-related. The switches had suffered a partial loss of configuration on the second power outage which resulted in several services including their NTP client and Spanning Tree to behave erratically. To resolve the Spanning Tree issue, the configuration was returned to the defaults for the protocol on the Nortel switches. This is shown below:

Hello Time:                 2 seconds
Maximum Age Time:           20 seconds
Forward Delay:              15 seconds
Bridge Hello Time:          2 seconds
Bridge Maximum Age Time:    20 seconds
Bridge Forward Delay:       15 seconds


This stabilised the switches within the older Cluster and reduced the volume of BPDU's that we being sent to the core switch.

An overview of the Spanning Tree Protocol is available here: http://en.wikipedia.org/wiki/Spanning_Tree_Protocol

The second issue surrounding problems with dropped packets and pause frames was again related to the power outage and it appears this had resulted in several dozen worker nodes having problems communicating across the switch environment. This issue was improved by the nodes being off-lined and then rebooted after the network reset.

We are still monitoring the situation and will report on any other action taken if required.

by Mark Mitchell (noreply@blogger.com) at March 23, 2011 01:05 PM

March 21, 2011

UKI-SCOTGRID-DURHAM, Durham, UK

Power Issues Redux

On the 15th of March we encountered two power outages within the Campus supply at Glasgow University. We had to put ourselves into downtime and remove ourselves from ATLAS production to affect a recovery from these power cuts. While the UPS infrastructure held up, we thought it prudent not to expose our user community to potential disruption.
The root cause of these outages has now been repaired and we came out of downtime on Thursday the 17th of March.

by Mark Mitchell (noreply@blogger.com) at March 21, 2011 04:46 PM

March 15, 2011

RAL-LCG2, Oxford, UK

Whole node jobs – Towards implementation

At the conclusion of my previous post we’d reached the point where we had a set of command line parameters for qsub that would allocate an entire node to a job, and show up in the output of torque as using multiple CPUs.

While this was useful for testing, I was keen to push as much of this configuration as possible into torque and maui themselves to avoid the problem of having to ensure that everywhere that needed to submit whole node jobs had the correct configuration.

As this was to be a new class of jobs that we were supporting, it made sense to create a new queue specifically for running these jobs. This would give us enough of a hook in maui and torque to hang the appropriate configuration on to only affect the whole node jobs. But what was the appropriate configuration?

We had two items that we needed to translate from the command line parameters into configuration directives:

-W x=NACCESSPOLICY:SINGLEJOB

and

-l nodes=1:ppn=8

Close reading of the maui documentation, lead me to the conclusion that the first item could be translated in to a flag directive for maui, which could be applied to a class, which directly corresponds to a queue in torque, making the configuration directive for maui :

CLASSCFG[gridWN] FLAGS=DEDICATED

Except it didn’t work.
After much searching I eventually discovered a mailing list post that said that FLAGS should be JOBFLAGS and with much delight I hastily updated my test maui instance to

CLASSCFG[gridWN] JOBFLAGS=DEDICATED

And it still didn’t work.
Further testing showed that JOBFLAGS was indeed correct, the problem this time lay with the DEDICATED. Close to defeat, I downloaded the source code for maui and began searching through it, finally I found a reference to DEDICATEDNODE. With much trepidation I reconfigured maui like so:

CLASSCFG[gridWN] JOBFLAGS=DEDICATEDNODE

Success! The output of diagnose -c showed the flag correctly set.

With this now done, I moved to the second item. This would be in some ways simpler and in other more complicated to implement. It was simpler in that we were already making use of the nodes directive in torque, so didn’t have to search for the correct syntax, but this was also the source of the complexity, as we would have to find some way to integrate the two use cases.

Our original use of nodes in torque was to select the OS that a job was requesting, this was a remnant from our migration from SL4 batch workers to SL5. Each node was configured on the torque server with a property of sl4 or sl5, and then each CE via torque’s submit filter feature would add a requirement of -l nodes=1:sl4 or -l nodes=1:sl5 as appropriate to the submit script generated by the job manager. We felt this was a useful trick to keep around as while OS migrations are not something we do often, they are quite involved and this did make it easier.

However this lead to a problem for the whole node solution – the nodes parameter set in the submit script overrode any default specified in the queue, thus while setting

resources_default.nodes = 1:ppn=8

on the queue was possible, it would be disregarded. One solution would be to edit the submit filter script to set a nodes line like

-l nodes=1:sl5:ppn=8

when submitting jobs to a whole node queue, but this wasn’t preferred as it would lead to the CEs having to “know” which queues were whole node queues. So I consulted the torque manual again, and discovered that torque had a built in resource of opsys which we could use*. Currently the opsys of all our worker nodes was unset in the mom configuration, so was defaulting to ‘Linux’. I tweaked some settings in quattor to set the opsys in the MOM configuration in each worker node to the appropriate value, and then could update the torque configuration and submit filter scripts to use opsys instead of nodes on our existing queues, leaving the nodes parameter free for user by the new whole node queue.

After these preparatory interventions, the stage was now set for the rolling out of the whole node configuration itself…

* As to why we hadn’t used opsys for the OS migration originally, I’m not entirely sure – I know we had been using the node properties for bigmem queues and it may be that we just stuck to what we knew when implementing the OS migration functionality.

by Derek Ross at March 15, 2011 05:00 PM

March 02, 2011

RAL-LCG2, Oxford, UK

Whole Node jobs – Initial investigations

For the past few months, after a request from several vos, we’ve been working on adding the ability to run whole-node jobs to our batch farm. These are jobs which are assigned a node entirely dedicated to their use.

The first task was to investigate the possible methods of implementing whole node scheduling.
An obvious way would have been to setup a dedicated batch system and CEs for whole nodes, with the worker nodes configured with one job slot per node, rather than the one job slot per core as we have now; but this would have lead to a large adminstrative overhead of moving worker nodes between it and our original batch system to meet fluctuating demand.

Having decided against that approach, we decided to investigate whether maui and torque could be configured to support whole node jobs alongside our existing configuration on the same batch server. This would eliminate the adminstrative overhead of moving nodes between separate batch systems, at the cost of making our existing batch system configuration slightly more complicated.

Work split in two lines of investigation at this point. The first was investigating whether maui could be told to allocate an entire node for a job. The second was investigating whether torque’s MPI support could be used to allocate a whole node.

James Adams worked on the first line of inquiry and quickly discovered that maui could be instructed to allocate a entire node for a job, on a job by job basis by adding the parameter:

-W x=NACCESSPOLICY:SINGLEJOB

to a qsub command line. This is a Resource Manager extension which torque passes through to maui, requesting that this job be the only job scheduled to a node. The only downside to this was the this was purely known to maui – torque was unaware that the job has been allocated the whole node and showed the other cores on the host as unused.

Concurrently with this I was investigating using the

-l nodes=8

command line parameter to qsub to get 8 job slots, however encountered an odd issue when running jobs on a small test farm – if the number of hosts was smaller than the number of nodes requested on the command line, then the job would fail to be submitted, even if one of those nodes could have run the job. It turns out the torque is inconsistent in its interpretation of ‘node’ – sometimes it means job slot, other times the whole host.
Switching to :

-l nodes=1:ppn=8

avoided the issue. Using the processors per node directive avoids the issue of torque interpretation of node, but left a problem : the number of cores per host is steadily increasing – our new tranche of worker nodes in testing has 12 cores per host, when these nodes are put into the batch farm, they would be able to run jobs submitted with the ppn=8 parameter, but would not be exclusive for those jobs – the other 4 job slots would still be available for scheduling by maui.

Comparing notes, we realised that if we combined our two approaches, then we would get something that would be almost perfect:

-l nodes=1:ppn=8 -W x=NACCESSPOLICY:SINGLEJOB

This instructs maui to allocate the whole node, and also informs torque that 8 processors on this node are in use – not completely perfect, but certainly better than torque believing that only 1 processor being in use, which could have lead to some alarming apparent drops in the number of cpus in use when we rolled it out to production.

by Derek Ross at March 02, 2011 10:52 AM

UKI-SCOTGRID-DURHAM, Durham, UK

Wide Area Wonder

After several month's of investigating asymmetric traffic flows from Glasgow to RAL, we have finally appear to have resolved the issue. Working with internal Computing Services staff at the University of Glasgow and GridPP staff at RAL we are now seeing sustained simultaneous transfer speeds around 2.3 Gig a second inbound and outbound.

The commands run for tests are shown below:

iperf -s -u -p 5001 -w 2M (client command to receive data)
iperf  -d -u -p 5001 -t 600 -w 1M -c hostname -b 700M -i 30 (server command to send data)

Associated network interface card and CPU loads on device one of the tests were run.




Effectively, the Glasgow site is now an extension of the Clydenet to JANET infrastructure in the west of Scotland and we will be monitoring the services over the next month to ensure that this network solution is as stable and reliable as the previous interconnection.

In addition to this work we will be investigating in Glasgow the optimisation of the Layer2 to Layer 3 network infrastructure between ourselves, the University and the rest of Gridpp over the next 3 months.

by Mark Mitchell (noreply@blogger.com) at March 02, 2011 10:50 AM

February 21, 2011

UKI-SCOTGRID-DURHAM, Durham, UK

The CE is dead. Long live the CE. Nos paenitet incommodo

As part of the on-going developments to the Scot Grid cluster at Glasgow, we have decommissioned our final LCG-CE, which resided on SVR021. The removal of this CE allows us to concentrate the support and development of two CE platforms; Cream and ARC. We are planning to conduct a series of tests around the three CREAM CE's we have deployed at Glasgow in an attempt to gain a better understanding of their maximum loading potential for running jobs and how to tweak them to gain the maximum efficiency from this service.

Additionally, we will be testing our availability metrics over the next month as the LCG-CE was one of the corner stones of Steve Lloyd's tests of our overall availability. This will now be monitored primarily through our SRM availability.

The reasons for decommissioning the LCG-CE are that we would be removing it at some point in the near future, all the big VO's do not have issues with submitting to Cream CEs and it simplifies our internal support requirements.

The new servers running Cream are svr008, svr014 and svr026.

Thank you LCG-CE and goodnight.

by Mark Mitchell (noreply@blogger.com) at February 21, 2011 04:21 PM

Covering up problems with CREAM

For some days now, ScotGrid Glasgow has been operating with only CREAM CEs, having turned our final lcg-CE off around the 14th. I'll let Mark cover the details of this in his later post, but I wanted to briefly mention one of the minor configuration details that caused some problems for us initially.

The gridmapdir (usually in /etc/grid-security/gridmapdir ) is a somewhat integral part of the pool account mapping system in LCG/gLite services. It contains one (empty) file for each pool account, plus hard-links to them from each DN(+VOMS Role) mapped to them. Basically, it's a cheap way to ensure that you don't get multiple mapped DNs to the same account (as you can always count the number of hard-links to an inode).

We share our gridmapdir, over NFS, to all of our CEs, to ensure that any incoming job from a given user is consistently mapped. Unfortunately, this lead to our minor configuration gaffe (which I just fixed).
The lcg-CE, you see, is configured to set the ownership and permissions on the gridmapdir to 0755 root:root. This is fine for it, since lcg-CEs do strange things like running their services with root permissions, and it prevents anything else from messing up the mappings.

CREAM CEs (using glexec), need to have their gridmapdir as 0775 root:glexec, a change which we hadn't made when we installed them (and which probably YAIM couldn't have done for us). This meant that, for the time the CREAM CEs were installed, they've never been able to create a new mapping in the gridmapdir, as they try to do that as members of the glexec group.
We never really noticed this problem while we had lcg-CEs which were busy, as the lcg-CE would almost always have also received jobs from the user previously and already performed the mapping.

Now that we don't have an lcg-CE, however, it started to cause some odd problems when we enabled new VOs, as the configuration seemed perfectly fine for the VO itself, but jobs would bounce off the CREAM CEs with "Failed to get the local userid with glexec" errors.
Obviously, this was trivially solved once we worked out what the issue was (by setting the gridmapdir's group-ownership and permissions to glexec g+w), but identifying it was a little tricky, as the default logging level for LCMAPS doesn't give many clues as to what problem it's having.
Turning the debug level up to 3 (in /opt/glite/etc/glexec.conf ) was sufficient to get it to log errors with gridmapdir_newlease(), however, and then, after some poking (and manual creation of DN links to see what happened), the problem became clear.

So, this is a cautionary tale about moving from a mixed CE environment to a monoculture (ignoring Stuart's ARC installation) - sometimes a misconfiguration in one service can be hidden by the correct functioning of the service you're just about to remove.

by Sam Skipsey (noreply@blogger.com) at February 21, 2011 03:45 PM

January 19, 2011

UKI-SCOTGRID-DURHAM, Durham, UK

My God; it's full of data-transfers!

The Great ATLAS Spacetoken Migration of 2011 kicked off yesterday evening, and with 47TB of data sitting in MCDISK at Glasgow, Brian and We decided to take the opportunity to see how fast we could push it across to DATADISK.
So, since ATLAS Data Management on this case happens over FTS (even though the vast majority of the transfers are internal to a site), we turned up the number of slots for STAR-GLASGOW a bit, from 20 (our default) to 50 (which was fun) up to 80 (although we peaked at around 65 used).
With effectively no limit from FTS, our data rates were... impressive. Although it's an unfair comparison (everyone else was limited by FTS, and we were mostly moving things over the internal network), we managed to hit a peak transfer rate of 1.5GB/s internally (yes, that's 12Gbit/s), and sustain at around 8Gbits/s. That equated to around 2/3s of the total UK data movement over STAR channels, or roughly 2/3s of ATLAS's total traffic in this migration. At that rate, none of our disk servers were stressed, and the network switches were intensely relaxed.

Some exciting graphs follow:



by Sam Skipsey (noreply@blogger.com) at January 19, 2011 09:52 AM

January 12, 2011

RAL-LCG2, Oxford, UK

SATA RAID controller experiences at the Tier1

I thought it might be worth recording a summary of our RAID controller experiences.  At the Tier1, we have a variety of RAID controllers in production disk servers:

Adaptec 52445 38
Adaptec 5405 60
Areca ARC-1170 21
Areca ARC-1280 50
3ware 9550SXU-16ML 86
3ware 9550SXU-4LP 86
3ware 9650SE-16ML 182
3ware 9650SE-4LPML 242
3ware 9650SE-24M8 60
TOTAL 825

Adaptec 52445

These were procured in 2009 and were the first Adaptec SATA RAID controllers purchased by the Tier1. So far, we have not had any serious problems with them

Adaptec 5405

The alternative procurement in 2009 originally had LSI controllers but we could not get them to pass our acceptance tests; they would drop the whole RAID6 array under load. Neither the vendor nor LSI could trace the problems and the vendor swapped the cards for the Adaptec ones. The replacement cards passed the acceptance tests without problem.

Areca ARC-1170

These are the oldest controllers in production at the Tier1, procured in 2005. We have had very few problems with these and they will be retired in 2011.

Areca ARC-1280

These were installed in one half of the 2008 procurement. They passed acceptance testing OK but from August 2010 onwards, we have had three cases of data loss on these machines. The controller appears to drop the array when a drive fails. We have had drive failures where the drive has been correctly ejected from the array so not all drive failure modes trigger the problem. It appears that some drive failures are not handled correctly by the controller but that is yet to be confirmed.

We have managed to force the problem with the acceptance tests but we have to wait for a drive failure to trigger it. We had to wait 40 days on the machine we were testing. The Tier1 has just finished removing all of the affected batch of servers from CASTOR production and we will start testing on all 50 machines. One machine was taken off site and had the system board and the backplane replaced. It is possible that the problems are not controller related but we need to run the testing to find out. We are in contact with Areca and WD and will be analysing drive and controller logs with them.

3ware 9550SXU-16ML

These were the first 3ware cards at the Tier1, procured in 2006. We have experienced load-related issues with the cards. When the RAID5 array was heavily loaded, it blocked access to the RAID1 system mirror and we could not get responses from nagios checks, SSH, etc. This was solved by adding an extra 3ware 9550SXU-4LP to each machine. The extra card controlled the system mirror, leaving the original card to control the system array.

We have seen these controllers fail to start a rebuild automatically despite this feature being enabled, and, occasionally, they have needed a reboot to see newly inserted drives.

3ware 9550SXU-4LP

These were added as the system array controllers on the 2006 procurement (see above). We have not seen serious problems with these controllers.

3ware 9650SE-16ML

These were procured in 2007 and we have had no real problems with these controllers. Occasionally, they have needed a reboot to see newly inserted drives.

3ware 9650SE-4LPML

These were also procured in 2007 as the controllers for the system arrays after our experience with the previous procurement. We have had no serious problems with these controllers.

3ware 9650SE-24M8

These made up half of the 2008 procurement and the machines that these cards were in failed the acceptance tests. We worked for many months with the vendor to get the systems working. Eventually, we found that replacing the original drives with drives from another manufacturer fixed the problem. We believe the cause was some sort of incompatibility between the controller and the drives, despite the drives being on the 3ware compatibility list.

by James Thorne at January 12, 2011 04:38 PM

December 15, 2010

UKI-SCOTGRID-DURHAM, Durham, UK

Clotted CREAM

Last time I was blogging, I mentioned some problem with our CREAM CE, and too many jobs in the Blah Registry.

Unlike my initial theory, the all_done interval problem turned out to not be the culprit; instead it was down to the Blah Registry.

CREAM splits the whole deal with being a Compute Element into two main parts: the interaction with the wider world, which is handled with some Java code using Tomcat; and the direct interaction with the batch system, called BLAH, and written in C and shell script.

The Java code, which I'll refer to as CREAM, as distinct from the BLAH parts, keeps it's state in the MySQL database. BLAH, on the other hand, uses a hand rolled indexed file, with C functions for accessing and writing data.

The BLAH registry is updated by the command blah_job_registry_add after the qsub is complete; to record the mapping between the CREAM job ID and batch system job id. This is the step were we ran into problems. The version of CREAM we were running was set to purge jobs after about two months - and in two months we were putting just over half a million jobs through it.

With that many jobs in the registry, it was taking a noticeable time to add any job. Further, the locking done effectively serialises access to the registry (i.e. Table locking in RDBMS parlance). Couple that with the Atlas pilot factory's favourite habit of dumping jobs in batches of 10 to 20 at a time, and you can see how some jobs ended up taking longer than the timeout to register.

Just before we'd encountered this, there was a new version of CREAM released (glite-CREAM-3.2.8) that cut the default time before purging to about one month, and put the indices in a mmaped file; both should mitigate this problem. We limped along with some workarounds for a bit [0], before doing that update earlier this week. The update from 3.2.7 to 3.2.8 went very quickly, by the way; took us about 5 minutes; although we did have to manually tidy up /etc/sudoers.

As it stands now, with about quarter of a million jobs in the registry, it's taking about a couple of seconds to register a job; but with occasional pauses when there are many jobs pending. Thus far it's prevented a recurrence of large number of blocked jobs, but I'll be keeping an eye on it.


[0] The other CE's were having hardware issues, and we didn't want to have all the CE's down at once...

by Stuart Purdie (noreply@blogger.com) at December 15, 2010 04:54 PM

RAL-LCG2, Oxford, UK

Migrating CMS from T10KA to T10KB Tape Media

One of the challanges of running the tape system is managing the transparent migration between media types as new hardware becomes available. Earlier this year we migrated CMS from the 500GB T10KA tape drives to the new double density (1TB) T10KB drives. Both the T10KA and the T10KB share the same media but we chose to repack the old data written by the T10KA drives onto T10KB density media in order to reclaim tapes.

The exercise started on the 9th July and was finished on the 27th October.  The CMS tape pools were moved in no particular order, but one pool at a time.For full A tapes the system averaged 30 tapes a day migrated.  The system was automated for a tape pool, but manual intervention required to change to the next tape pool.  Any tapes that had caused a problem with the repack were investigated (i.e. hardware problems that stopped the data being read, or problems with the format of the data on the tape that caused problems).

The plot below shows the drop in the number of tapes to be migrated.  The plotting started shortly after the migration started

At the start CMS had 1.5PB of data, about 3,300 tapes in 47 tape pools.  Once a tape had been emptied, the tape was moved into a holding pool so the tape was not re-used in case of problems and we had to go back to the original tape to access the data again.

Below are ganglia plots which show the steady state running

The hardware involved with this migration were 3-4 disk servers (old units aged out of production use); 3-4 T10KA tape drives dedicated to reading and 2 T10KB drives dedicated to writing. We tried adding extra disk servers to the system, but this had the effect of lowering the aggregate throughput. As the A and B drives are the same speed, having n+1 A drives was sufficient to keep the n B drives with files to migrate.

There were 8 problem tapes:

  • One tape snapped caused by a drive failure;
  • One tape has a media defect after file 8.
  • Six tapes had a missmatch between what the nameserver believed was on the tape and what was actually on the tape. These inconsistencies occured in 2008 and were were caused by operational problems in 2008 with an earlier release of CASTOR. We believe only CMS were affected.

Next year we expect to carry out a migration of the remaining VOs off the T10KA media onto T10KC drives (and new media). For the next migration (A-C) the C drives will be quicker than the A, so as a starting point we expect we will need 2n+1 A drives to keep n C drives busy and will need to use the newest generation of 10Gb enabled disk servers servers.

Tim Folkes/Andrew Sansum

by Andrew Sansum at December 15, 2010 03:02 PM

December 10, 2010

RAL-LCG2, Oxford, UK

RAL Tier1 – Plans for Christmas Holiday.

RAL closes at 3pm on Friday 24th December and will re-open on Tuesday 4th January. During this time we plan for services at the RAL Tier1 to remain up. The usual on-call cover will be in place (as per nights and weekends). This cover will be enhanced by daily checks of key systems. Some hardware interventions, such as to swap out faulty disks will also take place over this time. However, we have relaxed our expectation that the on-call person will respond within two hours, particularly on 25/26 December and 1st January.

During the holiday will check for tickets in the usual manner. However, only service critical issues will be dealt with.

The status of the RAL Tier1 can be seen on the dashboard at:

http://www.gridpp.rl.ac.uk/status/

Gareth Smith

by Gareth Smith at December 10, 2010 11:14 AM

December 02, 2010

RAL-LCG2, Oxford, UK

Top BDII on the Amazon EC2 Cloud

As a first step towards investigating the viability of hosting services on third-party cloud infrastructure, and as part of the Tier-1′s cloud strategy strand, we recently decided to deploy a Top BDII on the Amazon EC2 cloud, EC2 being one component of the Amazon Web Service (http://aws.amazon.com/). The promise of some free pump-priming resources on the AWS Free Usage Tier (http://aws.amazon.com/free/), announced last month, provided further encouragement.

Why choose a Top BDII? A few reasons, including the fact that it’s a service that doesn’t require any authentication or authorisation, and it has no persistent state information, make it an ideal candidate, and it’s also one of the easiest gLite node types to configure.

The AWS was new to me, so I spent a bit of time familiarising myself with the basic concepts of Amazon Machine Images (AMIs), instance types, and generally getting to know my way around the AWS Management Console.

The Free Usage Tier provides use of an Amazon EC2 Micro Instance for up to 750 hours per month (i.e., one instance running continuously). However, it turned out that the Micro Instance doesn’t provide sufficient resources to run a BDII service without running into memory issues (only 613 MB of memory is available). One possibility would be to optimise the memory footprint of the application to match these resources, but we had already dusted off one of our corporate credit cards before the Free Usage Tier was announced, so I decided it would be simpler to move onto a Large Instance (7.5 GB of memory), take the small financial hit, and move onto the standard, pay-as-you-go service rates.

I chose a Basic 64-bit Amazon Linux AMI 1.0 (AMI Id: ami-2272864b; Amazon Linux AMI Base 1.0, EBS boot, 64-bit architecture with Amazon EC2 AMI Tools. Root Device Size: 10 GiB), running on a Large Instance with the default kernel and RAM disk options, and by launching the instance took a step into the unknown.

However, in this case the unknown turned out to be quite familiar territory, and after requesting the connection details, and logging into the instance, I used the standard SL5 repos, installed the glite-BDII_top metapackage and its dependencies, and proceeded to configure the service with YAIM. A few minutes later, with the Site BDIIs all queried and the databases populated with information, I was able to query the ldap service, and had a fully-functional Top BDII up and running.

Since running this instance is not free, I’ve been stopping and starting it as required. The monthly costs of running this service would be about $250 for the Large Instance itself, and about $370 for the network traffic (based on metrics from our production service). There are some other costs associated with block storage, but these are expected to be small, so the overall cost of running the service in this configuration is a little over $600 per month. Other pricing models are available, but these have not yet been investigated.

So, the Tier-1 is actively increasing its experience of cloud technologies (see Derek’s recent post on StratusLab: http://www.gridpp.rl.ac.uk/blog/2010/11/26/first-steps-with-stratuslab-release-0-1/), and there are plenty more interesting aspects to explore on EC2 such as providing custom AMIs with pre-configured Top BDIIs, integrating with our standard production service configuration (provided by Quattor QWG templates), adding our normal service monitoring and metrics, and so on.

Ultimately, wLCG have other plans for failover of BDII services between regions, and other services may be better candidates for serious consideration of off-site running, but for me at least it’s been an interesting exercise, and paves the way for other services which will, of course, provide different challenges.

by Matt Hodges at December 02, 2010 01:35 PM

November 29, 2010

UKI-SCOTGRID-DURHAM, Durham, UK

Scotgrid weekend Downtime

Due to an issue with one of the environmental control units relating to our water cooling system we had to take part of the cluster down over the weekend. The issue has now been identified and rectified. Normal service was resumed this afternoon for  the entire cluster.

by Mark Mitchell (noreply@blogger.com) at November 29, 2010 09:14 PM