Change search
Refine search result
1234 51 - 100 of 157
Cite
Citation style
• apa
• harvard1
• ieee
• modern-language-association-8th-edition
• vancouver
• Other style
More styles
Language
• de-DE
• en-GB
• en-US
• fi-FI
• nn-NO
• nn-NB
• sv-SE
• Other locale
More languages
Output format
• html
• text
• asciidoc
• rtf
Rows per page
• 5
• 10
• 20
• 50
• 100
• 250
Sort
• Standard (Relevance)
• Author A-Ö
• Author Ö-A
• Title A-Ö
• Title Ö-A
• Publication type A-Ö
• Publication type Ö-A
• Issued (Oldest first)
• Created (Oldest first)
• Last updated (Oldest first)
• Disputation date (earliest first)
• Disputation date (latest first)
• Standard (Relevance)
• Author A-Ö
• Author Ö-A
• Title A-Ö
• Title Ö-A
• Publication type A-Ö
• Publication type Ö-A
• Issued (Oldest first)
• Created (Oldest first)
• Last updated (Oldest first)
• Disputation date (earliest first)
• Disputation date (latest first)
Select
The maximal number of hits you can export is 250. When you want to export more records please use the Create feeds function.
• 51.
Blekinge Institute of Technology, School of Engineering, Department of Systems and Software Engineering.
Blekinge Institute of Technology, School of Engineering, Department of Mathematics and Natural Sciences. Blekinge Institute of Technology, School of Engineering, Department of Systems and Software Engineering. Blekinge Institute of Technology, School of Engineering, Department of Systems and Software Engineering.
Optimal recovery schemes in fault tolerant distributed computing2005In: Acta Informatica, ISSN 0001-5903, E-ISSN 1432-0525, Vol. 41, no 6, p. 341-365Article in journal (Refereed)

Clusters and distributed systems offer fault tolerance and high performance through load sharing. When all n computers are up and running, we would like the load to be evenly distributed among the computers. When one or more computers break down, the load on these computers must be redistributed to other computers in the system. The redistribution is determined by the recovery scheme. The recovery scheme is governed by a sequence of integers modulo n. Each sequence guarantees minimal load on the computer that has maximal load even when the most unfavorable combinations of computers go down. We calculate the best possible such recovery schemes for any number of crashed computers by an exhaustive search, where brute force testing is avoided by a mathematical reformulation of the problem and a branch-and-bound algorithm. The search nevertheless has a high complexity. Optimal sequences, and thus a corresponding optimal bound, are presented for a maximum of twenty one computers in the distributed system or cluster.

• 52. Klonowska, Kamilla
The maximum gain of increasing the number of preemptions in multiprocessor scheduling2009In: ACTA INFORMATICA, ISSN 0001-5903 , Vol. 46, no 4, p. 285-295Article in journal (Refereed)

We consider the optimal makespan C(P, m, i) of an arbitrary set P of independent jobs scheduled with i preemptions on a multiprocessor with m identical processors. We compare the ratio for such makespans for i and j preemptions, respectively, where i < j. This ratio depends on P, but we are interested in the P that maximizes this ratio, i. e. we calculate a formula for the worst case ratio G(m, i, j) defined as G(m, i, j) = max C(P, m, i)/C(P, m, j), where the maximum is taken over all sets P of independent jobs.

• 53. Klonowska, Kamilla
Using Modulo Golomb Rulers for Optimal Recovery Schemes in Distributed Computing2003Conference paper (Refereed)
• 54. Klonowska, Kamilla
Comparing the optimal performance of parallel architectures2004In: Computer journal, ISSN 0010-4620, E-ISSN 1460-2067, Vol. 47, no 5, p. 527-544Article in journal (Refereed)

Consider a parallel program with n processes and a synchronization granularity z. Consider also two parallel architectures: an SMP with q processors and run-time reallocation of processes to processors, and a distributed system (or cluster) with k processors and no run-time reallocation. There is an inter-processor communication delay of t time units for the system with no run-time reallocation. In this paper we define a function H(n,k,q,t,z) such that the minimum completion time for all programs with n processes and a granularity z is at most H(n,k,q,t,z) times longer using the system with no reallocation and k processors compared to using the system with q processors and run-time reallocation. We assume optimal allocation and scheduling of processes to processors. The function H(n,k,q,t,z)is optimal in the sense that there is at least one program, with n processes and a granularity z, such that the ratio is exactly H(n,k,q,t,z). We also validate our results using measurements on distributed and multiprocessor Sun/Solaris environments. The function H(n,k,q,t,z) provides important insights regarding the performance implications of the fundamental design decision of whether to allow run-time reallocation of processes or not. These insights can be used when doing the proper cost/benefit trade-offs when designing parallel execution platforms.

• 55. Klonowska, Kamilla
Extended Golomb Rulers as the New Recovery Schemes in Distributed Dependable Computing2005Conference paper (Refereed)

Clusters and distributed systems offer fault tolerance and high performance through load sharing. When all computers are up and running, we would like the load to be evenly distributed among the computers. When one or more computers break down the load on these computers must be redistributed to other computers in the cluster. The redistribution is determined by the recovery scheme. The recovery scheme should keep the load as evenly distributed as possible even when the most unfavorable combinations of computers break down, i.e. we want to optimize the worst-case behavior. We have previously defined recovery schemes that are optimal for some limited cases. In this paper we find a new recovery schemes that are based on so called Golomb rulers. They are optimal for a much larger number of cases than the previous results.

• 56. Klonowska, Kamilla
Using Modulo Rulers for Optimal Recovery Schemes in Distributed Computing2004Conference paper (Refereed)

Clusters and distributed systems offer fault tolerance and high performance through load sharing. When all computers are up and running, we would like the load to be evenly distributed among the computers. When one or more computers break down the load on these computers must be redistributed to other computers in the cluster. The redistribution is determined by the recovery scheme. The recovery scheme should keep the load as evenly distributed as possible even when the most unfavorable combinations of computers break down, i.e. we want to optimize the worst-case behavior. We define recovery schemes, which are optimal for a larger number of computers down than in previous results. We also show that the problem of finding optimal recovery schemes for a cluster with n computers corresponds to the mathematical problem of finding the longest sequence of positive integers for which the sum of the sequence and the sums of all subsequences modulo n are unique.

• 57. Kågström, Simon
Automatic Low Overhead Program Instrumentation with the LOPI framework2005Conference paper (Refereed)
• 58. Kågström, Simon
Cibyl: an Environment for Language Diversity on Mobile Devices2007Conference paper (Refereed)

With an estimated installation base of around 1 billion units, the Java J2ME platform is one of the largest development targets available. For mobile devices, J2ME is often the only available environment. For the very large body of software written in C other languages, this means difficult and costly porting to another language to support J2ME devices. This paper presents the Cibyl programming environment which allows existing code written in C and other languages supported by GCC to be recompiled into Java bytecode and run with close to native Java performance on J2ME devices. Cibyl translates compiled MIPS binaries into Java bytecode. In contrast to other approaches, Cibyl supports the full C language, is based on unmodified standard tools, and does not rely on source code conversion. To achieve good performance, Cibyl employs extensions to the MIPS architecture to support low-overhead calls to native Java functionality and use knowledge of the MIPS ABI to avoid computing unused values and transfer unnecessary registers. An evaluation on multiple virtual machines shows that Cibyl achieves performance similar to native Java, with results ranging from a slowdown of around to a speedup of over 9 depending on the JVM and the benchmark.

• 59. Kågström, Simon
Experiences from Implementing Multiprocessor Support for an Industrial Operating System Kernel2005Conference paper (Refereed)
• 60. Kågström, Simon
Optimizations in the Cibyl binary translator for J2ME devices2008Conference paper (Refereed)

The Java J2ME platform is one of the largest software platforms available, and often the only available development platform for mobile phones, which is a problem when porting C or C++ applications. The Cibyl binary translator targets this problem, translating MIPS binaries into Java bytecode to run on J2ME devices. This paper presents the optimization framework used by Cibyl to provide compact and well-performing translated code. Cibyl optimizes expensive multiplications/divisions, floating point support, function co-location to Java methods and provides a peephole optimizer. The paper also evaluates Cibyl performance both in a real-world GPS navigation application where the optimizations increase display update frequency with around 15% and a comparison against native Java and the NestedVM binary translator where we show that Cibyl can provide significant advantages for common code patterns.

• 61. Kågström, Simon
The Design and Implementation of Multiprocessor Support for an Industrial Operating System Kernel2005Report (Other academic)

The ongoing transition from uniprocessor to multiprocessor computers requires support from the operating system kernel. Although many general-purpose multiprocessor operating systems exist, there is a large number of specialized operating systems which require porting in order to work on multiprocessors. In this paper we describe the multiprocessor port of a cluster operating system kernel from a producer of industrial systems. Our initial implementation uses a giant locking scheme that serializes kernel execution. We also employed a method in which CPU-local variables are placed in a special section mapped to per-CPU physical memory pages. The giant lock and CPU-local section allowed us to implement an initial working version with only minor changes to the original code, although the giant lock and kernel-bound applications limit the performance of our multiprocessor port. Finally, we also discuss experiences from the implementation.

• 62. Kågström, Simon
The Design and Implementation of Multiprocessor Support for an Industrial Operating System Kernel2009In: International Journal of Computers and Their Applications, ISSN 1076-5204, Vol. 16, no 1, p. 59-68Article in journal (Refereed)

The ongoing transition from uniprocessor to multi-core computers requires support from the operating system kernel. Although many general-purpose multiprocessor operating systems exist, there is a large number of specialized operating systems which require porting in order to work on multiprocessors. In this paper we describe the multiprocessor port of a cluster operating system kernel from a producer of industrial systems. Our initial implementation uses a giant locking scheme that serializes kernel execution. We also employed a method in which CPU-local variables are placed in a special section mapped to per-CPU physical memory pages. The giant lock and CPU-local section allowed us to implement an initial working version with only minor changes to the original code, although the giant lock and kernel-bound applications limit the performance of our multiprocessor port. Finally, we also discuss experiences from the implementation.

• 63. Kågström, Simon
A novel method for adding multiprocessor support to a large and complex uniprocessor kernel2004Conference paper (Refereed)

Summary form only given. The current trend of using multiprocessor computers for server applications requires operating system adoptions for high performance. However, modifying large bodies of software is very costly and time-consuming, and the cost of porting an operating system to a multiprocessor might not be motivated by the potential performance benefits. We present a novel method, the application kernel approach, for adaption of an existing uniprocessor kernel to SMP hardware. Our approach considers the existing kernel as a "black box", to which no or small changes are made. Instead, the original kernel runs OS-services unmodified on one processor whereas the other processors execute applications on top of a small custom kernel. A prototype implementation illustrates that the approach can be realized with fairly small resources. We also present an initial performance evaluation where we show that the performance is good for user-bound applications.

• 64. Kågström, Simon
The Application Kernel Approach: A Novel Approach for Adding SMP Support to Uniprocessor Operating Systems2006In: Software - Practice & Experience, ISSN 0038-0644 , Vol. 36, no 14, p. 1563-1583Article in journal (Refereed)
• 65.
Blekinge Institute of Technology, School of Engineering, Department of Mathematics and Natural Sciences.
Blekinge Institute of Technology, School of Engineering, Department of Mathematics and Natural Sciences. Blekinge Institute of Technology, School of Computing.
Optimal computer crash performance precaution2012In: Discrete Mathematics and Theoretical Computer Science, ISSN 1462-7264, Vol. 14, no 1, p. 55-68Article in journal (Refereed)

For a parallel computer system withmidentical computers, we study optimal performance precaution for one possible computer crash. We want to calculate the cost of crash precaution in the case of no crash. We thus define a tolerance level r meaning that we only tolerate that the completion time of a parallel program after a crash is at most a factor r + 1 larger than if we use optimal allocation on m - 1 computers. This is an r-dependent restriction of the set of allocations of a program. Then, what is the worst-case ratio of the optimal r-dependent completion time in the case of no crash and the unrestricted optimal completion time of the same parallel program? We denote the maximal ratio of completion times f(r, m) - i.e., the ratio for worst-case programs. In the paper we establish upper and lower bounds of the worst-case cost function f(r, m) and characterize worst-case programs.

An Optimal Execution Time Estimate of Static Versus Dynamic Allocation in Multiprocessor Systems1995In: SIAM journal on computing (Print), ISSN 0097-5397, E-ISSN 1095-7111, Vol. 24, no 4, p. 751-764Article in journal (Refereed)

Consider a multiprocessor with k identical processors, executing parallel programs consisting of n processes. Let T$-s$/(P) and T$-d$/(P) denote the execution times for the program P with optimal static and dynamic allocations, respectively, i.e., allocations giving minimal execution time. We derive a general and explicit formula for the following maximal execution time ratio: g(n, k) $EQ max T$-s$/(P)/T$-d$/(P), where the maximum is taken over all programs P consisting of n processes. Any interprocess dependency structure for the programs P is allowed only by avoiding deadlock. Overhead for synchronization and reallocation is neglected. Basic properties of the function g(n, k) are established, from which we obtain a global description of the function. Plots of g(n, k) are included. The results are obtained by investigating a mathematical formulation. The mathematical tools involved are essentially tools of elementary combinatorics. The formula is a combinatorial function applied on certain extremal matrices corresponding to extremal programs. It is mathematically complicated but rapidly computed for reasonable n and k, in contrast to the np-completeness of the problems of finding optimal allocations. • 67. Lennerstad, Håkan An Optimal Execution Time Estimate of Static versus Dynamic Allocation in Multiprocessor Systems1992Report (Other academic) Consider a multiprocessor with$k$identical processors, executing parallel programs consisting of$n$processes. Let$T_s(P)$and$T_d(P)$denote the execution times for the program$P$with optimal static and dynamic allocations respectively, i. e. allocations giving minimal execution time. We derive a general and explicit formula for the maximal execution time ratio$g(n,k)=\max T_s(P)/T_d(P)$, where the maximum is taken over all programs$P$consisting of$n$processes. Any interprocess dependency structure for the programs$P$is allowed, only avoiding deadlock. Overhead for synchronization and reallocation is neglected. Basic properties of the function$g(n,k)$are established, from which we obtain a global description of the function. Plots of$g(n,k)$are included. The results are obtained by investigating a mathematical formulation. The mathematical tools involved are essentially tools of elementary combinatorics. The formula is a combinatorial function applied on certain extremal matrices corresponding to extremal programs. It is mathematically complicated but rapidly computed for reasonable$n$and$k$, in contrast to the np-completeness of the problems of finding optimal allocations. • 68. Lennerstad, Håkan Combinatorics for multiprocessor scheduling optimization and other contexts in computer architecture1996Conference paper (Refereed) The method described consists of two steps. First, unnecessary programs are eliminated through a sequence of program transformations. Second, within the remaining set of programs, sometimes regarded as matrices, those where all possible combinations of synchronizations occur equally frequently are proven to be extremal. At this stage we obtain a formulation which is simple enough to allow explicit formulas to be derived. It turns out that the same method can be used for obtaining worst-case bounds on other NP-hard problems within computer architecture. • 69. Lennerstad, Håkan Decomposing rational numbers2010In: Acta Arithmetica, ISSN 0065-1036, E-ISSN 1730-6264, Vol. 145, no 3, p. 213-220Article in journal (Refereed) • 70. Lennerstad, Håkan En annan addition och Stern-Brocots träd2006In: Nämnaren : tidskrift för matematikundervisning, ISSN 0348-2723, no 3, p. 45-48Article in journal (Refereed) När två bråk adderas så adderar man bråkens täljare för sig och nämnare för sig. Så får man väl inte göra? Jodå, det får man, men inte med vanlig addition. Här får vi en glimt av vad följderna blir av denna annorlunda addition. • 71. Blekinge Institute of Technology, School of Engineering, Department of Mathematics and Natural Sciences. Blekinge Institute of Technology, School of Engineering, Department of Systems and Software Engineering. Generalizations of the floor and ceiling functions using the Stern-Brocot tree2006Report (Refereed) We consider a fundamental number theoretic problem where practial applications abound. We decompose any rational number a/b in c ratios as evenly as possible while maintaining the sum of numerators and the sum of denominators. The minimum and maximum of the ratios give rational estimates of a/b from below and from above. The case c=b gives the usual floor and ceiling functions. We furthermore define the max-min-difference, which is zero iff c≤GCD(a,b), quantifying the distance to relative primality. A main tool for investigating the properties of these quantities is the Stern-Brocot tree, where all positive rational numbers occur in lowest terms and in size order. We prove basic properties such that there is a unique decomposition that gives both the minimum and the maximum. It turns out that this decomposition contains at most three distinct ratios. The problem has arisen in a generalization of the 4/3-conjecture in computer science. • 72. Lennerstad, Håkan Guaranteeing Response Times for Aperiodic Tasks in Global Multiprocessor Scheduling2007In: Real-time systems, ISSN 0922-6443, E-ISSN 1573-1383, Vol. 35, no 2, p. 135-151Article in journal (Refereed) We provide a constant time schedulability test for an on-line multiprocessor server handling aperiodic tasks. Dhall's effect is avoided by dividing the tasks in two priority classes based on task utilization: heavy and light. We prove that if the load on the multiprocessor server stays below U threshold = 3 - root 7 approximately equals 35.425%, the server can accept an incoming aperiodic task and guarantee that the deadlines of all accepted tasks will be met. The same number 35.425% is also a threshold for a task to be characterized as heavy. The bound U threshold = 3 - root 7 approximately equals 35.425% is easy-to-use, but not sharp if we know the number of processors in the multiprocessor system. Assuming the server to be equipped with m processors, we calculate a formula for the sharp bound U threshold (m), which converges to U threshold from above as m -> infinity . The results are based on a utilization function u(x) = 2(1 - x)/(2 + root 2+2x). By using this function, the performance of the multiprocessor server can in some cases be improved beyond U threshold(m) by paying the extra overhead of monitoring the individual utilization of the current tasks. • 73. Lennerstad, Håkan Optimal combinatorial functions comparing multiprocess allocation performance in multiprocessor systems2000In: SIAM journal on computing (Print), ISSN 0097-5397, E-ISSN 1095-7111, p. 1816-1838Article in journal (Refereed) For the execution of an arbitrary parallel program P, consisting of a set of processes with any executable interprocess dependency structure, we consider two alternative multiprocessors. The first multiprocessor has q processors and allocates parallel programs dynamically; i.e., processes may be reallocated from one processor to another. The second employs cluster allocation with k clusters and u processors in each cluster: here processes may be reallocated within a cluster only. Let T-d(P, q) and T-c(P, k, u) be execution times for the parallel program P with optimal allocations. We derive a formula for the program independent performance function [GRAPHICS] Hence, with optimal allocations, the execution of P can never take more than a factor G(k, u, q) longer time with the second multiprocessor than with the first, and there exist programs showing that the bound is sharp. The supremum is taken over all parallel programs consisting of any number of processes. Overhead for synchronization and reallocation is neglected only. We further present a tight bound which exploits a priori knowledge of the class of parallel programs intended for the multiprocessors, thus resulting in a sharper bound. The function g(n, k, u, q) is the above maximum taken over all parallel programs consisting of n processes. The functions G and g can be used in various ways to obtain tight performance bounds, aiding in multiprocessor architecture decisions. • 74. Lennerstad, Håkan Optimal Combinatorial Functions Comparing Multiprocess Allocation Performance in Multiprocessor Systems1993Report (Other academic) For the execution of an arbitrary parallel program P, consisting of a set of processes, we consider two alternative multiprocessors. The first multiprocessor has q processors and allocates parallel programs dynamically, i.e. processes may be reallocated from one processor to another. The second employs cluster allocation with k clusters and u processors in each cluster - here processes may be reallocated within a cluster only. Let T_d(P,q) and T_c (P,k,u) be execution times for the parallel program P with optimal allocations. We derive a formula for the program independent performance function G(k,u,q)=\sup_ all parallel programs P T_c(P,k,u)}{T_d(P,q)}. Hence, with optimal allocations, the execution of$P$can never take more than a factor$G(k,u,q)$longer time with the second multiprocessor than with the first, and there exist programs showing that the bound is sharp. The supremum is taken over all parallel programs consisting of any number of processes. Any interprocess dependency structure is allowed for the parallel programs, except deadlock. Overhead for synchronization and reallocation is neglected only. We further present optimal formulas which exploits a priori knowledge of the class of parallel programs intended for the multiprocessor, thus resulting in sharper optimal bounds. The function g(n,k,u,q) is the above maximum taken over all parallel programs consisting of n processes. The function s(n,v,k,u) is the same maximum, with q=n, taken over all parallel programs of$n$processes which has a degree of parallelism characterized by a certain parallel profile vector v=(v_1,...,v_n). The functions can be used in various ways to obtain optimal performance bounds, aiding in multiprocessor architecture decisions. An immediate application is the evaluation of heuristic allocation algorithms. It is well known that the problems of finding the corresponding optimal allocations are NP-complete. We thus in effect present a methodology to obtain optimal control of NP-complete scheduling problems. • 75. Lennerstad, Håkan Optimal Computer Combinatorics2003Book (Other academic) • 76. Lennerstad, Håkan Optimal Scheduling Results for Parallel Computing1994In: SIAM news, ISSN 1557-9573, Vol. 27, no 7Article in journal (Refereed) The situation of having jobs of some kind which may be dependent, and having several possible executors of jobs, is a very common one and appears in very diverse contexts. Clearly some kind of scheduling is needed: who shall do what? Also one often wants to have it all done as quicky as possible. One central scheduling question is whether to allow transferring of jobs between executors. The antipoles here are allowing unlimited transferring, usually called dynamic allocation of jobs, and allowing no transferring, which is static allocation. • 77. Lennerstad, Håkan Optimal Scheduling Results for Parallel Computing1996In: Applications on advanced architecture computers / [ed] Astfalk, Greg, Philadelphia, USA: SIAM , 1996, p. 155-164Chapter in book (Refereed) Load balancing is one of many possible causes of poor performance on parallel machines. If good load balancing of the decomposed algorithm or data is not achieved, much of the potential gain of the parallel algorithm is lost to idle processors. Each of the two extremes of load balancing - static allocation and dynamic allocation - has advantages and disadvantages. This chapter illustrates the relationship between static and dynamic allocation of tasks. • 78. Lennerstad, Håkan Optimal Worst Case Formulas Comparing Cache Memory Associativity1995Report (Other academic) Consider an arbitrary program$P$which is to be executed on a computer with two alternative cache memories. The first cache is set associative or direct mapped. It has$k$sets and$u$blocks in each set, this is called a (k,u)$-cache. The other is a fully associative cache with $q$ blocks - a $(1,q)$-cache. We present formulas optimally comparing the performance of a $(k,u)$-cache compared to a $(1,q)$-cache for worst case programs. Optimal mappings of the program variables to the cache blocks are assumed. Let $h(P,k,u)$ denote the number of cache hits for the program $P$, when using a $(k,u)$-cache and an optimal mapping of the program variables of $P$ to the cache blocks. We establish an explicit formula for the quantity $$\inf_P \frac{h(P,k,u)}{h(P,1,q)},$$ where the infimum is taken over all programs $P$ which contain $n$ variables. The formula is a function of the parameters $n,k,u$ and $q$ only. We also deduce a formula for the infimum taken over all programs of any number of variables, this formula is a function of $k,u$ and $q$. We further prove that programs which are extremal for this minimum may have any hit ratio, i.e. any ratio $h(P,1,q)/m(P)$. Here $m(P)$ is the total number of memory references for the program P. We assume the commonly used LRU replacemant policy, that each variable can be stored in one memory block, and is free to be stored in any block. Since the problems of finding optimal mappings are NP-hard, the results provide optimal bounds for NP-hard quantities. The results on cache hits can easily be transformed to results on access times for different cache architectures.

Optimal worst case formulas comparing cache memory associativity2000In: SIAM journal on computing (Print), ISSN 0097-5397, E-ISSN 1095-7111, p. 872-905Article in journal (Refereed)

In this paper we derive a worst case formula comparing the number of cache hits for two different cache memories. From this various other bounds for cache memory performance may be derived. Consider an arbitrary program P which is to be executed on a computer with two alternative cache memories. The rst cache is set-associative or direct-mapped. It has k sets and u blocks in each set; this is called a (k, u)-cache. The other is a fully associative cache with q blocks-a (1, q)-cache. We derive an explicit formula for the ratio of the number of cache hits h(P, k, u) for a(k, u)-cache compared to a (1, q)-cache for a worst case program P. We assume that the mappings of the program variables to the cache blocks are optimal. The formula quantifies the ratio [GRAPHICS] where the in mum is taken over all programs P with n variables. The formula is a function of the parameters n, k, u, and q only. Note that the quantity h ( P, k, u) is NP-hard. We assume the commonly used LRU (least recently used) replacement policy, that each variable can be stored in one memory block, and that each variable is free to be mapped to any set. Since the bound is decreasing in the parameter n, it is an optimal bound for all programs with at most n variables. The formula for cache hits allows us to derive optimal bounds comparing the access times for cache memories. The formula also gives bounds ( these are not optimal, however) for any other replacement policy, for direct-mapped versus set-associative caches, and for programs with variables larger than the cache memory blocks.

• 80. Lundberg, Lars
Analyzing Fixed Priority Global Multiprocessor Scheduling.2002Conference paper (Refereed)

We consider a multiprocessor where hard real-time tasks are scheduled globally on m processors. Each task has a fixed priority and tasks are executed using preemptive scheduling. The state-of-the-art priority assignment scheme in such cases is called RM-US[US-LIMIT] [1], where US-LIMIT is a parameter to the RM-US algorithm. The challenge is to find the US-LIMIT that can gaurantee schedulability for as high utilization as possible. The previously best known US-LIMIT value could guarantee schedulability as long the multiprocessor utilization is below m/(3m-2), i.e. 0.33333 when m --> infinity. In this paper we define a new equation for US-LIMIT which quarantees schedulability for higher utilization values than the previous result. When m --> infinity we can now guarantee schedulability for all tasks sets when the multiprocessor utilization is below 0.37482. We also show that our US-LIMIT values are optimal, i.e. we show that there is no room for further improvement of this state-of-the-art priority assignment scheme.

• 81. Lundberg, Lars
Distributed High Performance Large Integer Arithmetic2002Conference paper (Refereed)

We have evaluated a number of techniques for obtaining distributed high performance arithmetic for large integers. Two main ideas are presented: a technique for handling carry propagation in parallel additions and a technique for distributing not only the processing but also the storage of very large integers onto a number of computers. These ideas have been compared to state-of-the-art arithmetic libraries. We have carried out performance evaluations on a Linux cluster with 32 computers and an SMP with eight processors. The performance of addition was improved by a factor 13, and that the method where the storage of an integer is distributed was superior to the approaches where only processing is distributed. The multiplication performance was improved by a factor of 7.

• 82. Lundberg, Lars
Evaluating the performance implications of binding threads to processors1997Conference paper (Refereed)
• 83. Lundberg, Lars
Fixed priority scheduling of age constraint processes1998Conference paper (Refereed)

Real-time systems often consist of a number of independent processes which operate under an age constraint. In such systems, the maximum time from the start process L-i in cycle k to the end in cycle k+1 must not exceed the age constraint A(i) for that process. The age constraint can be met by using fixed priority scheduling and periods equal to A(i)/2. However, this approach restricts the number of process sets which are schedulable. In this paper, we define a method for obtaining process periods other than A(i)/2. The periods are calculated in such a way that the age constraints are met. Our approach is better in the sense that a larger number of process sets can be scheduled compared to using periods equal to A(i)/2.

• 84. Lundberg, Lars
Multiprocessor performance evaluation of billing gateway systems for telecommunication applications1996Conference paper (Refereed)

This paper describes a performance evaluation of different implementation alternatives of a parallel C++ program for telecommunication applications. The implementation alternatives use different combinations of Solaris threads and Unix processes. A Sun multiprocessor with eight processors was used in the performance evaluation. The measurements show that the multiprocessor system favors implementations with many Unix processes, containing a small number of threads each, compared to implementations with one process containing a large number of threads. It also turns out that the performance can be increased by binding threads to processors, and that the overhead grows significantly when the number of processors increases.

• 85.
Blekinge Institute of Technology, Department of Software Engineering and Computer Science.
Predicting and bounding the speedup of multithreaded Solaris programs1999In: Journal of Parallel and Distributed Computing, ISSN 0743-7315, E-ISSN 1096-0848, p. 322-333Article in journal (Refereed)

In Solaris, threads are frequently relocated. The data associated with a relocated thread have to be moved from the cache of the old processor to the new processor. In order to avoid poor memory performance due to thread relocation, threads can be bound to processors-static scheduling. Finding a static schedule which results in maximum speedup is NP-hard. It is even difficult to determine if a static schedule is close to the optimal case or not. Here, a technique for predicting the speedup of multithreaded Solaris programs is presented. Based on an existing theoretical result, a lower bound on the maximal speedup is also obtained. The predicted speedup and the bound are based on recordings from a single-processor execution. When comparing the predictions with the real speedup using a multiprocessor with eight processors, we see that the predictions are very good. By comparing the speedup of a static schedule with the bound, we see that it is worthwhile to look for other schedules. (C) 1999 Academic Press.

• 86. Lundberg, Lars
Slack-based multiprocessor scheduling of aperiodic real-time tasks2011In: Real-time systems, ISSN 0922-6443, E-ISSN 1573-1383, Vol. 47, no 6, p. 618-638Article in journal (Refereed)

We provide a constant time schedulability test and priority assignment algorithm for an on-line multiprocessor server handling aperiodic tasks. The so called Dhall's effect is avoided by dividing tasks in two priority classes based on their utilization: heavy and light. The improvement in this paper is due to assigning priority of light tasks based on slack-not on deadlines. We prove that if the load on the multiprocessor stays below (3 - √5)/2 ≈ 38.197%, the server can accept an incoming aperiodic task and guarantee that the deadlines of all accepted tasks will be met. This is better than the current state-of-the-art algorithm where the priorities of light tasks are based on deadlines (the corresponding bound is in that case 35.425%). The bound (3 - √5)/ 2 can be improved if the number of processors m is known. There is a formula for the sharp bound Uthreshold(m) = 3m - 2 - √5m 2 - 8m + 4/2(m - 1), which converges to (3 - √5)/2 from above as m→∞. For m≥3, the bound is higher (i.e., better) than the corresponding sharp bound for the state-of-the-art algorithm where the priorities of light tasks are based on deadlines. A simulation study also indicates that when m>3 the best effort behavior of the priority assignment scheme suggested here is better than that of the traditional scheme where priorities are based on deadlines.

• 87. Lundberg, Lars
Utilization based schedulability bounds for age constraint process sets in real-time systems2002In: Real-time systems, ISSN 0922-6443, E-ISSN 1573-1383, p. 273-295Article in journal (Refereed)

Some real-time systems consist of a number of processes that operate under age constraints. In such systems, the maximum time from the start of process L-i in cycle k to the end in cycle k+1 must not exceed the age constraint A(i) for that process. The age constraint can be met by using fixed priority scheduling and periods equal to A(i)/2. However, this approach restricts the number of process sets which are schedulable. In this paper, we define a method for obtaining process periods other than A(i)/2. The periods are calculated in such a way that the age constraints are met. Our approach is better in the sense that a larger number of process sets can be scheduled compared to using periods equal to A(i)/2. The main results in this paper are a number of performance bounds on age constraint processes. These bounds show that there is a significant gain in worst case as well as in best case behavior by using periods other than A(i)/2, particularly when there are a large number of processes in the system.

• 88. Lundberg, Lars
Evaluating Heuristic Scheduling Algorithms for High Performance Parallel Processing2003Conference paper (Refereed)

Most cluster systems used in high performance computing do not allow process relocation at run-time. Finding an allocation that results in minimal completion time is NP-hard and (non-optimal) heuristic algorithms have to be used. One major drawback with heuristics is that we do not know if the result is close to optimal or not. Here, we present a method for finding an upper bound on the minimal completion time for a given program. The bound helps the user to determine when it is worth-while to continue the heuristic search for better allocations. Based on some parameters derived from the program, as well as some parameters describing the hardware platform, the method produces the minimal completion time bound. A practical demonstration of the method is presented using a tool that produces the bound.

• 89.
Blekinge Institute of Technology, Faculty of Computing, Department of Computer Science and Engineering.
Blekinge Institute of Technology, Faculty of Computing, Department of Computer Science and Engineering. Blekinge Institute of Technology, Faculty of Computing, Department of Communication Systems. Compuverde AB.
Cache Support in a High Performance Fault-Tolerant Distributed Storage System for Cloud and Big Data2015In: 2015 IEEE 29TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS, IEEE Computer Society, 2015, p. 537-546Conference paper (Refereed)

Due to the trends towards Big Data and Cloud Computing, one would like to provide large storage systems that are accessible by many servers. A shared storage can, however, become a performance bottleneck and a single-point of failure. Distributed storage systems provide a shared storage to the outside world, but internally they consist of a network of servers and disks, thus avoiding the performance bottleneck and single-point of failure problems. We introduce a cache in a distributed storage system. The cache system must be fault tolerant so that no data is lost in case of a hardware failure. This requirement excludes the use of the common write-invalidate cache consistency protocols. The cache is implemented and evaluated in two steps. The first step focuses on design decisions that improve the performance when only one server uses the same file. In the second step we extend the cache with features that focus on the case when more than one server access the same file. The cache improves the throughput significantly compared to having no cache. The two-step evaluation approach makes it possible to quantify how different design decisions affect the performance of different use cases.

• 90. Lundberg, Lars
Recovery Schemes for High Availability and High Performance Distributed Real-Time Computing2003Conference paper (Refereed)

Clusters and distributed systems offer fault tolerance and high performance through load sharing, and are thus attractive in real-time applications. When all computers are up and running, we would like the load to be evenly distributed among the computers. When one or more computers-fail the must be redistributed. The redistribution is determined by the recovery scheme. The recovery scheme should keep the load as evenly distributed as possible even when the most unfavorable combinations of computers break down, i.e. we want to optimize the worst-case behavior. In this paper we define recovery schemes, which are optimal for a number of important cases. We also show that the problem of finding optimal recovery schemes corresponds to the mathematical problem of finding sequences of integers with minimal sum and for which all sums of subsequences are unique.

• 91. Lundberg, Lars
Comparing the Optimal Performance of Multiprocessor Architectures2003Conference paper (Refereed)

Consider a parallel program with n processes and a synchronization granularity z. Consider also two multiprocessors: a multiprocessor with q processors and run-time reallocation of processes to processors, and a multiprocessor with k processors and no run-time reallocation. There is an inter processor communication delay of t time units for the system with no run-time reallocation. In this paper we define a function g(n,k,q,t,z) such that the minimum completion time for all programs with n processes and a granularity z is at most g(n,k,q,t,z) times longer using the system with no reallocation and k processors compared to using the system with q processors and run-time reallocation. We assume optimal allocation and scheduling of processes to processors. The function g(n,k,q,t,z) is optimal in the sense that there is at least one program, with n processes and a granularity z, such that the ratio is exactly g(n,k,q,t,z). We also validate our results using measurements on distributed and multiprocessor Sun/Solaris environments.

• 92. Lundberg, Lars
An Optimal Lower Bound on the Maximal Speedup in Multiprocessors with Clusters1995Conference paper (Refereed)

We consider an ideal multiprocessor system with q processors and a centralized scheduler without overhead that select processes from one common pool, permitting dynamic relocation of processes. A parallel program P consisting of n processes is executed on this system and terminates when all processes are completed. Due to synchronizations, processes may be blocked while waiting for events in other processes. The parallel program is executed using some schedule of processes to processors, resulting in a speedup $sigma@. We then consider an ideal multiprocessor with k clusters containing u processors each. In this system processes may not be relocated between clusters. Finding a schedule which results in maximum speedup is NP-hard. Here, we present a formula for the optimal lower bound on the maximum speedup for program P, as a function of q, n,$sigma@, k and u. We also present a formula for the optimal lower bound when the number of processes (n) is unknown. Using these results we are able to decide if a certain schedule is close to optimal or if it is worth-while to look for other schedules. This is demonstrated by evaluating the speedup of a specific schedule of a particular program.

• 93. Lundberg, Lars
An Optimal Upper Bound on the Minimal Completion Time in Distributed Supercomputing1994Conference paper (Refereed)

We first consider an MIMD multiprocessor configuration with n processors. A parallel program, consisting of n processes, is executed on this system-one process per processor. The program terminates when all processes are completed. Due to synchronizations, processes may be blocked waiting for events in other processes. Associated with the program is a parallel profile vector nu , index i (1<or=i<or=n) in this vector indicates the percentage of the total execution time when i processes are executing. We then consider a distributed MIMD supercomputer with k clusters, containing u processors each. The same parallel program, consisting of n processes, is executed on this system. Each process can only be executed by processors in the same cluster. Finding a schedule with minimal completion time in this case is NP-hard. We are interested in the gain of using n processors compared to using k clusters containing u processors each. The gain is defined by the ratio between the minimal completion time using processor clusters and the completion time using a schedule with one process per processor. We present the optimal upper bound for this ratio in the form of an analytical expression in n, nu , k and u. We also demonstrate how this result can be used when evaluating heuristic scheduling algorithms (12 Refs.)

• 94. Lundberg, Lars
Bounding the gain of changing the number of memory modules in shared memory multiprocessors1997In: Nordic Journal of Computing, ISSN 1236-6064, Vol. 4, no 3, p. 233-58Article in journal (Refereed)

We consider a multiprocessor, with p processors and m memory modules. If more than one processor want to access a memory module simultaneously, these accesses will be serialized due to memory contention. The completion time of a program executing on this system is thus affected by the way addresses are mapped onto memory modules, and finding a mapping which results in minimal completion time is NP-hard. If we change the number of memory modules from m to m’, while keeping the number processors constant, we will generally change the minimal completion time. The gain of changing the number of memory modules from m to m’ for a certain program is defined as the ratio between the minimal completion times, using m and m’ modules respectively. Here, we present an optimal upper bound on the maximum gain of changing the number of memory modules, as a function of m, m’ and p, including the case when m’ is infinitely large. The bound is obtained by investigating a mathematical formulation. The mathematical tools involved are essentially elementary combinatorics. The formula for calculating the bound is mathematically complicated but can be rapidly computed for reasonable m, m’ and p. This bound makes it possible to do price-performance trade-offs and to compare any mapping of addresses to memory modules with the optimal case. The results are applicable to different multiprocessor architectures, e.g. systems with crossbar networks and systems with multiple buses. The results also make it possible to calculate optimal performance bounds for multiprocessors containing cache memories: as well as for multiprocessors with no cache memories. Moreover, we discuss how the results can be used for calculating bounds for programs with as well as without synchronizations.

• 95. Lundberg, Lars
Combinatorics for Scheduling Optimization1996Conference paper (Refereed)
• 96. Lundberg, Lars
Comparing the optimal performance of different MIMD multiprocessor architectures1998Conference paper (Refereed)

We compare the performance of systems consisting of one large cluster containing q processors with systems where processors are grouped into k clusters containing u processors each. A parallel program, consisting of n processes, is executed on this system. Processes may be relocated between the processors in a cluster. They may,however not be relocated from one cluster to another. The performance criterion is the completion time of the parallel program. We present two functions: g(n,k,u,q) and G(k,u,q). Provided that we can find optimal or near optimal schedules,these functions put optimal upper bounds on the gain of using one cluster containing q processors compared to using k clusters containing u processors each. The function g(n,k,u,q) is valid for programs with n processes, whereas G(k,u,q) only depends on the two multiprocessor architectures. By evaluating g(n,k,u,q) and G(k,u,q) we show that the gain of increasing the cluster size from 1 to 2 and from 2 to 4 is relatively large. However, the gain of using clusters larger than 4 is very limited.

• 97. Lundberg, Lars
Global multiprocessor scheduling of aperiodic tasks using time-independent priorities2003Conference paper (Refereed)

We provide a constant time schedulability test for a multiprocessor server handling aperiodic tasks. Dhall's effect is avoided by dividing the tasks in two priority classes based on task utilization: heavy and light. We prove that if the load on the multiprocessor server stays below U-threshold = 3 - root7 = 35.425%, the server can accept incoming aperiodic tasks and guarantee that the deadlines of all accepted tasks will be met. 35.425% utilization is also a threshold for a task to be characterized as heavy. The bound U-threshold = 3 - root7 approximate to 35.425% is easy-to-use, but not sharp if we know the number of processors in the multiprocessor. For a server with m processors, we calculate a formula for the sharp bound U-threshold(m), which converges to U-threshold from above as m --> infinity. The results are based on a utilization function u(m)(x) = 2(1-x)/(2+ root2+2x)+x/m. By using this function, the performance of the multiprocessor can in some cases be improved beyond U-threshold(m) by paying the extra overhead of monitoring the individual utilization of the current tasks.

• 98. Lundberg, Lars
Optimal bounds on the gain of permitting dynamic allocation of communication channels in distributed computing1999In: Acta Informatica, ISSN 0001-5903, E-ISSN 1432-0525, p. 425-446Article in journal (Refereed)

Consider a distributed system consisting of n computers connected by a number of identical broadcast channels. All computers may receive messages from all channels. We distinguish between two kinds of systems: systems in which the computers may send on any channel (dynamic allocation) and system where the send port of each computer is statically allocated to a particular channel. A distributed task (application) is executed on the distributed system. A task performs execution as well as communication between its subtasks. We compare the completion time of the communication for such a task using dynamic allocation and k(d) channels with the completion time using static allocation and k(s) channels. Some distributed tasks will benefit very much from allowing dynamic allocation, whereas others will work fine with static allocation. In this paper we define optimal upper and lower bounds on the gain (or loss) of using dynamic allocation and k(d) channels compared to static allocation and k(s) channels. Our results show that, for some tasks, the gain of permitting dynamic allocation is substantial, e.g. when k(s) = k(d) = 3, there are tasks which will complete 1.89 times faster using dynamic allocation compared to using the best possible static allocation, but there are no tasks with a higher such ratio.

• 99. Lundberg, Lars
Slack-based Global Multiprocessor Scheduling of Aperiodic Tasks in Parallel Embedded Real-time Systems2008Conference paper (Refereed)

We provide a constant time schedulability test and priority assignment algorithm for an on-line multiprocessor server handling aperiodic tasks. Dhall's effect is avoided by dividing tasks in two priority classes based on their utilization: heavy and light. The improvement in this paper is due to assigning priority of light tasks based on slack - not on deadlines. We prove that if the load on the multiprocessor stays below (3 - √5)/2 ≈ 38.197%, the server can accept an incoming aperiodic task and guarantee that the deadlines of all accepted tasks will be met. This is better than the current state-of-the-art algorithm where the priorities of light tasks are based on deadlines (the corresponding bound is in that case 35.425%). ©2008 IEEE.

• 100. Lundberg, Lars
Using Recorded Values for Bounding the Minimum Completion Time in Multiprocessors1998In: IEEE Transactions on Parallel and Distributed Systems, ISSN 1045-9219, E-ISSN 1558-2183, Vol. 9, no 4Article in journal (Refereed)

The way the processes in a parallel program are scheduled on the processors of a multiprocessor system affects the performance significantly. Finding a schedule of processes to processors which results in minimum completion time is NP-hard. Therefore, one has to resort to heuristic schedules. However, it is often difficult to determine if a specific schedule is close to the optimal case or if it is worthwhile to look for other schedules. Based on information from previous executions of the parallel program, we present a formula for an upper bound on the minimum completion time of the program. The bound is a function of a set of parameters. Some of these parameters are obtained from the previous executions of the program and the others describe the target multiprocessor architecture for which we want to bound the minimum completion time. The bound is optimal when it is based on information from one previous execution. Using these results, we are able to decide if a certain schedule is close to optimal or if it is worthwhile to look for other schedules. This is demonstrated by evaluating the completion time of a specific schedule of a particular program. The proofs used for obtaining the bound are based on program transformations and combinatorial mathematics.

1234 51 - 100 of 157
Cite
Citation style
• apa
• harvard1
• ieee
• modern-language-association-8th-edition
• vancouver
• Other style
More styles
Language
• de-DE
• en-GB
• en-US
• fi-FI
• nn-NO
• nn-NB
• sv-SE
• Other locale
More languages
Output format
• html
• text
• asciidoc
• rtf