The papers below are subject to ACM, IEEE, or other copyrights as noted in the paper's text

QuickSearch:   Number of matching entries: 0.

Search Settings

J.A. Clemente, J. Resano and D. Mozos (2014), "An Approach to Manage Reconfigurations and Reduce Area Cost in Hard Real-Time Reconfigurable Systems", ACM Tansactions on Embedded Computing Systems., February, 2014. Vol. 13-4(4)
BibTeX:
@article{Clemente2014ManageReconfigurations,
  author = {Juan Antonio Clemente and Javier Resano and Daniel Mozos},
  title = {An Approach to Manage Reconfigurations and Reduce Area Cost in Hard Real-Time Reconfigurable Systems},
  journal = {ACM Tansactions on Embedded Computing Systems},
  year = {2014},
  volume = {13-4},
  number = {4}
}
A. Ferreron, D. Suarez, J. Alastruey, T. Monreal and V. Viñals (2014), "Block Disabling Characterization and Improvements in CMPs Operating at Ultra-low Voltages", In 26th Int. Symp. on Computer Architecture and High Performance Computing (SBAC-PAD 2014).
Abstract: Power density has become the limiting factor in
technology scaling as power budget limits the amount of hardware
that can be active at the same time. Reducing supply voltage
to ultra-low voltage ranges close to the threshold region has the
promise of great energy savings. However, the potential savings of
voltage scaling become limited by the correct operation of SRAM
cells, which is not guaranteed below Vddmin, the minimum
voltage in which cache structures operate reliably.

Understanding the effects of operating below Vddmin requires
complex modeling, so we introduce an updated probability failure
model of SRAM cells at 22nm and explore the reliability impact
of lowering the chip voltage supply below Vddmin in shared-
memory coherent chip-multiprocessors (CMP) running a variety
of parallel workloads. A microarchitectural technique to cope
with cache reliability at ultra-low voltages is block disabling;
however, in many cases, the savings in on-chip caches do not
compensate for the consumption in the rest of the system, as
the consumption increase of the off-chip memory may offset the
on-chip gain.

We make the case that existing coherence mechanisms can
provide the substrate to improve energy savings with block
disabling and propose two low-complexity techniques. Taking the
best of both techniques we can scale voltage below Vddmin and
reduce system energy up to 39%, and system energy-delay up
to 10%. Besides, by lowering the CMP consumption in a power-
constrained scenario, we could activate offline cores, reaching a
potential speedup between 3.7 and 4.4

BibTeX:
@inproceedings{Ferreron2014,
  author = {Alexandra Ferreron and Dario Suarez and Jesus Alastruey and Teresa Monreal and Victor Viñals},
  title = {Block Disabling Characterization and Improvements in CMPs Operating at Ultra-low Voltages},
  booktitle = {26th Int. Symp. on Computer Architecture and High Performance Computing (SBAC-PAD 2014)},
  year = {2014}
}
J. Albericio, P. Ibáñez, V. Viñals and J.M. Llabería (2013), "Exploiting reuse locality on inclusive shared last-level caches", ACM Transactions on Architecture and Code Optimization., January, 2013. Vol. 9-4, pp. 38:1-38:19. ACM.
BibTeX:
@article{Albericio:2013:ERL:2400682.2400697,
  author = {Albericio, Jorge and Ibáñez, Pablo and Viñals, Víctor and Llabería, Jose María},
  title = {Exploiting reuse locality on inclusive shared last-level caches},
  journal = {ACM Transactions on Architecture and Code Optimization},
  publisher = {ACM},
  year = {2013},
  volume = {9-4},
  pages = {38:1--38:19},
  doi = {10.1145/2400682.2400697}
}
A. Ferrerón-Labari, M. Ortín-Obón, D. Suárez-Gracia, J. Alastruey and V. Viñals-Yúfera (2013), "Shrinking L1 Instruction Caches to Improve Energy-Delay in SMT Embedded Processors", In Proceedings of the 26th International Conference on Architecture of Computing Systems (ARCS 2013)., February, 2013. , pp. 256-267. Springer Berlin / Heidelberg.
BibTeX:
@inproceedings{Ferreron2013,
  author = {Alexandra Ferrerón-Labari and Marta Ortín-Obón and Darío Suárez-Gracia and Jesús Alastruey and Víctor Viñals-Yúfera},
  title = {Shrinking L1 Instruction Caches to Improve Energy-Delay in SMT Embedded Processors},
  booktitle = {Proceedings of the 26th International Conference on Architecture of Computing Systems (ARCS 2013)},
  publisher = {Springer Berlin / Heidelberg},
  year = {2013},
  pages = {256--267}
}
C. González, S. Sánchez, A. Paz, J. Resano, D. Mozos and A. Plaza (2013), "Use of FPGA or GPU-based architectures for remotely sensed hyperspectral image processing", the VLSI Journal of Integration. Vol. 46-2, pp. 89-103.
BibTeX:
@article{Gonzalez13FPGAandGPU,
  author = {C. González and S. Sánchez and A. Paz and J. Resano and D. Mozos and A. Plaza},
  title = {Use of FPGA or GPU-based architectures for remotely sensed hyperspectral image processing},
  journal = {the VLSI Journal of Integration},
  year = {2013},
  volume = {46-2},
  pages = {89--103},
  doi = {10.1016/j.vlsi.2012.04.002}
}
R. Gran, J. Segarra, C. Rodriguez, L.C. Aparicio and V. Viñals (2013), "Optimizing a combined WCET-WCEC problem in instruction fetching for real-time systems", Journal of Systems Architecture., October, 2013. Vol. 59(9), pp. 667-678. Elsevier.
BibTeX:
@article{Gran13Optimizing,
  author = {R. Gran and J. Segarra and C. Rodriguez and L. C. Aparicio and V. Viñals},
  title = {Optimizing a combined WCET-WCEC problem in instruction fetching for real-time systems},
  journal = {Journal of Systems Architecture},
  publisher = {Elsevier},
  year = {2013},
  volume = {59},
  number = {9},
  pages = {667--678},
  doi = {10.1016/j.sysarc.2013.07.012}
}
S. López, T. Vladimirova, C. González, J. Resano, D. Mozos and A. Plaza (2013), "The Promise of Reconfigurable Computing for Hyperspectral Imaging On-Board Systems: Review and Trends", Proceedings of the IEEE., March, 2013. Vol. 101-3, pp. 698-722.
BibTeX:
@article{Lopez13Promise,
  author = {S. López and T. Vladimirova and C. González and J. Resano and D. Mozos and A. Plaza},
  title = {The Promise of Reconfigurable Computing for Hyperspectral Imaging On-Board Systems: Review and Trends},
  journal = {Proceedings of the IEEE},
  year = {2013},
  volume = {101-3},
  pages = {698--722}
}
J. Olivito, C. González and J. Resano (2013), "An FPGA-based specific processor for Blokus-Duo", In International Conference on Field-Progammable Technology 2013. Kyoto, Japan, December, 2013. , pp. 502-505.
BibTeX:
@inproceedings{Olivito2013BlokusDuo,
  author = {Olivito, J. and González, C. and Resano, J.},
  title = {An FPGA-based specific processor for Blokus-Duo},
  booktitle = {International Conference on Field-Progammable Technology 2013},
  year = {2013},
  pages = {502--505}
}
M.J.R. Ortiga, E.L. Pueyo, A. Rodríguez-Pintó, L.H. Ros, A. Pocoví, J.L. Briz and J.C. Ciria (2013), "A computed tomography approach for understanding 3D deformation patterns in complex folds", Tectonophysics., May, 2013. Vol. 593, pp. 57-72. Elsevier.
BibTeX:
@article{Ortiga13tomography,
  author = {María José Ramón Ortiga and Emilio Luis Pueyo and Adriana Rodríguez-Pintó and Luis Humberto Ros and Andrés Pocoví and José Luis Briz and José Carlos Ciria},
  title = {A computed tomography approach for understanding 3D deformation patterns in complex folds},
  journal = {Tectonophysics},
  publisher = {Elsevier},
  year = {2013},
  volume = {593},
  pages = {57--72}
}
M. Ortín, A. Ferrerón, J. Albericio, D. Suárez, M. Villarroya-Gaudó, C. Izu and V. Viñals (2013), "Characterization and cost-efficient selection of NoC topologies for general purpose CMPs", In Proceedings of the 2013 Interconnection Network Architecture: On-Chip, Multi-Chip. New York, NY, USA, January, 2013. , pp. 21-24. ACM.
BibTeX:
@inproceedings{Ortin2013Topologies,
  author = {Ortín, Marta and Ferrerón, Alexandra and Albericio, Jorge and Suárez, Darío and Villarroya-Gaudó, María and Izu, Cruz and Viñals, Víctor},
  title = {Characterization and cost-efficient selection of NoC topologies for general purpose CMPs},
  booktitle = {Proceedings of the 2013 Interconnection Network Architecture: On-Chip, Multi-Chip},
  publisher = {ACM},
  year = {2013},
  pages = {21--24},
  url = {http://doi.acm.org/10.1145/2482759.2482765},
  doi = {10.1145/2482759.2482765}
}
A. Sankaranarayanan, E. Ardestani, J.L. Briz and J. Renau (2013), "An Energy Efficient GPGPU Memory Hierarchy With Tiny Incoherent Caches", In ISLPED 2013. Beijing, September, 2013.
BibTeX:
@inproceedings{Sankaranarayanan13Energy,
  author = {A. Sankaranarayanan and E. Ardestani and J. L. Briz and J. Renau},
  title = {An Energy Efficient GPGPU Memory Hierarchy With Tiny Incoherent Caches},
  booktitle = {ISLPED 2013},
  year = {2013}
}
J. Albericio, P. Ibáñez, R. Gran, V. Viñals and J. Llabería (2012), "ABS: a Low-Cost Adaptive Controller for Prefetching in a Banked Shared Last-Level Cache", ACM Transactions on Architecture and Code Optimization., January, 2012. Vol. 8-4, pp. 19.1-19.20. ACM.
Abstract: Hardware data prefetch is a very well known technique for hiding memory latencies. However, in a multicore system fitted with a shared Last-Level Cache (LLC), prefetch induced by a core consumes common resources such as shared cache space and main memory bandwidth. This may degrade the performance of other cores and even the overall system performance unless the prefetch aggressiveness of each core is controlled from a system standpoint. On the other hand, LLCs in commercial chip multiprocessors are more and more frequently organized in independent banks. In this contribution, we target for the first time prefetch in a banked LLC organization and propose ABS, a low-cost controller with a hill-climbing approach that runs stand-alone at each LLC bank without requiring inter-bank communication. The ABS controller operation repeats at fixed time intervals (epochs). In each epoch a single core is selected and its prefetch aggressiveness is changed following a previous trend. At the end of the epoch a global performance index is evaluated and, depending on the improvement observed against a reference epoch, the tested change is maintained or undone. Using multiprogrammed SPEC2K6 workloads, our analysis shows that the mechanism improves both user-oriented metrics (Harmonic Mean of Speedups by 27% and Fairness by 11%) and system-oriented metrics (Weighted Speedup increases 22% and Memory Bandwidth Consumption decreases 14%) over an eight-core baseline system that uses aggressive sequential prefetch with a fixed degree. Similar conclusions can be drawn by varying the number of cores or the LLC size, when running parallel applications, or when other prefetch engines are controlled.
BibTeX:
@article{Albericio2012,
  author = {J. Albericio and P. Ibáñez and R. Gran and V. Viñals and J.M. Llabería},
  title = {ABS: a Low-Cost Adaptive Controller for Prefetching in a Banked Shared Last-Level Cache},
  journal = {ACM Transactions on Architecture and Code Optimization},
  publisher = {ACM},
  year = {2012},
  volume = {8-4},
  pages = {19.1-19.20},
  doi = {10.1145/2086696.2086698}
}
P. García-Risueño and P.E. Ibáñez (2012), "A review of High Performance Computing foundations for scientists", International Journal of Modern Physics C., May, 2012. Vol. 23(7), pp. 33.
Abstract: The increase of existing computational capabilities has made simulation emerge as a third discipline of Science, lying midway between experimental and purely theoretical branches [1, 2]. Simulation enables the evaluation of quantities which otherwise would not be accessible, helps to improve experiments and provides new insights on systems which are analysed [3-6]. Knowing the fundamentals of computation can be very useful for scientists, for it can help them to improve the performance of their theoretical models and simulations. This review includes some technical essentials that can be useful to this end, and it is devised as a complement for researchers whose education is focused on scientific issues and not on technological respects. In this document we attempt to discuss the fundamentals of High Performance Computing (HPC) [7] in a way which is easy to understand without much previous background. We sketch the way standard computers and supercomputers work, as well as discuss distributed computing and discuss essential aspects to take into account when running scientific calculations in computers.
BibTeX:
@article{Garcia-Risueno2012,
  author = {Pablo García-Risueño and Pablo E. Ibáñez},
  title = {A review of High Performance Computing foundations for scientists},
  journal = {International Journal of Modern Physics C},
  year = {2012},
  volume = {23(7)},
  pages = {33},
  doi = {10.1142/S0129183112300011}
}
X. Qian, B. Sahelices and J. Torrellas (2012), "BulkSMT: Designing SMT Processors for Atomic-Block Execution", In International Symposium on High Performance Computer Architecture (HPCA 2012). New Orleans, Louisiana, February, 2012.
BibTeX:
@inproceedings{Qian2012BulkSMT,
  author = {Xuehai Qian and Benjamin Sahelices and Josep Torrellas},
  title = {BulkSMT: Designing SMT Processors for Atomic-Block Execution},
  booktitle = {International Symposium on High Performance Computer Architecture (HPCA 2012)},
  year = {2012},
  note = {(aceptado)}
}
C. González, J. Resano, D. Mozos, A. Plaza and D. Valencia (2012), "FPGA Implementation of Abundance Estimation for Spectral Unmixing of Hyperspectral Data Using the Image Space Reconstruction Algorithm", IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING. Vol. 5, nº 1, pp. 248-261. IEEE.
BibTeX:
@article{Resano2012Abundance,
  author = {Carlos González and Javier Resano and Daniel Mozos and Antonio Plaza and David Valencia},
  title = {FPGA Implementation of Abundance Estimation for Spectral Unmixing of Hyperspectral Data Using the Image Space Reconstruction Algorithm},
  journal = {IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING},
  publisher = {IEEE},
  year = {2012},
  volume = {5, nº 1},
  pages = {248--261},
  doi = {10.1109/JSTARS.2011.2171673}
}
C. González, J. Resano, D. Mozos, A. Plaza and D. Valencia (2012), "FPGA Implementation of the Pixel Purity Index Algorithm for Remotely Sensed Hyperspectral Image Analysis", IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING. Vol. 50, nº2, pp. 374 - 388. IEEE.
BibTeX:
@article{Resano2012PixelPurity,
  author = {Carlos González and Javier Resano and Daniel Mozos and Antonio Plaza and David Valencia},
  title = {FPGA Implementation of the Pixel Purity Index Algorithm for Remotely Sensed Hyperspectral Image Analysis},
  journal = {IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING},
  publisher = {IEEE},
  year = {2012},
  volume = {50, nº2},
  pages = {374 -- 388},
  doi = {10.1109/TVLSI.2010.2050158}
}
B. Sahelices, A. de Dios, P. Ibáñez, V. Viñals and J. Llabería (2012), "Efficient Handling of Lock Hand-off in DSM Multiprocessors with Buffering Coherence Controllers", Journal of Computer Science and Technology. Vol. 27(1), pp. 75-91. Science Press.
Abstract: Synchronization in parallel programs is a major performance bottleneck in multiprocessor systems. Shared data is protected by locks and a lot of time is spent on the competition arising at the lock hand-off. In order to be serialized, requests to the same cache line can either be bounced (NACKed) or buffered in the coherence controller. In this paper, we focus mainly on systems whose coherence controllers buffer requests. In a lock hand-off, a burst of requests to the same line arrive at the coherence controller. During lock hand-off only the requests from the winning processor contribute to progress of the computation, since the winning processor is the only one that will advance the work. This key observation leads us to propose a hardware mechanism we call request bypassing, which allows requests from the winning processor to bypass the requests buffered in the coherence controller keeping the lock line. We present an inexpensive implementation of request bypassing that reduces the time spent on all the execution phases of a critical section (acquiring the lock, accessing shared data, and releasing the lock) and which, as a consequence, speeds up the whole parallel computation. This mechanism requires neither compiler or programmer support nor ISA or coherence protocol changes. By simulating a 32-processor system, we show that using request bypassing does not degrade but rather improves performance in three applications with low synchronization rates, while in those having a large amount of synchronization activity (the remaining four), we see reductions in execution time and in lock stall time ranging from 14% to 39% and from 52% to 71%, respectively. We compare request bypassing with a previously proposed technique called read combining and with a system that bounces requests, observing a significantly lower execution time with the bypassing scheme. Finally, we analyze the sensitivity of our results to some key hardware and software parameters.
BibTeX:
@article{Sahelices2012,
  author = {B. Sahelices and A. de Dios and P. Ibáñez and V. Viñals and J.M. Llabería},
  title = {Efficient Handling of Lock Hand-off in DSM Multiprocessors with Buffering Coherence Controllers},
  journal = {Journal of Computer Science and Technology},
  publisher = {Science Press},
  year = {2012},
  volume = {27(1)},
  pages = {75--91},
  doi = {10.1007/s11390-012-1207-2}
}
J. Segarra, C. Rodríguez, R. Gran, L.C. Aparicio and V. Viñals (2012), "A small and effective data cache for real-time multitasking systems", In IEEE Real-Time and Embedded Technology and Applications Symposium. Beijing, China, April, 2012. , pp. 45-54. IEEE Computer Society Press.
Abstract: In multitasking real-time systems, the WCET of each task and also the effects of interferences between tasks in the worst-case scenario need to be calculated. This is especially complex with data caches. In this paper, we propose a small instruction-driven data cache (256 bytes) that effectively exploits locality. It works by preselecting a subset of memory instructions that will have data cache replacement permission. Selection of such instructions is based on data reuse theory. Since each selected memory instruction replaces its own data cache line, it prevents pollution and performance in tasks becomes independent of the size of the associated data structures. We have modeled several memory configurations using the Lock-MS WCET analysis method. Our results show that, on average, our data cache effectively services 88% of program data. Such results translate into doubling the performance of the tested real-time multitasking experiments, which (increasing from 75 to 89%) approaches the ideal case of always hitting in instruction and data caches. Additionally, we show that using partitioning on our proposed hardware only provides marginal benefits.
BibTeX:
@inproceedings{Segarra12Small,
  author = {J. Segarra and C. Rodríguez and R. Gran and L. C. Aparicio and V. Viñals},
  title = {A small and effective data cache for real-time multitasking systems},
  booktitle = {IEEE Real-Time and Embedded Technology and Applications Symposium},
  publisher = {IEEE Computer Society Press},
  year = {2012},
  pages = {45--54},
  doi = {10.1109/RTAS.2012.11}
}
M.A. Montañés, E. Torres, J. Martínez-Rincón and J.E. Herrero-Jaraba (2012), "Real-Time GPU Color-Based Segmentation of Football Players", Journal of Real-Time Image Processing, Springer., December, 2012. Vol. 7, pp. 267-279.
Abstract: In this paper, we propose a multi camera application capable of processing high resolution images and extracting features based on colors patterns over graphic processing units (GPU). The goal is to work in real time under the uncontrolled environment of a sport event like a football match. Since football players are composed for diverse and complex color patterns, a Gaussian Mixture Models (GMM) is applied as segmentation paradigm, in order to analyze sport live images and video. Optimization techniques have also been applied over the C++ implementation using profiling tools focused on high performance. Time consuming tasks were implemented over NVIDIA?s CUDA platform, and later restructured and enhanced, speeding up the whole process significantly. Our resulting code is around 4-11 times faster on a low cost GPU than a highly optimized C++ version on a central processing unit (CPU) over the same data. Real time has been obtained processing until 64 frames per second. An important conclusion derived of our study is the scalability of the application to the number of cores on the GPU
BibTeX:
@article{Torres2012,
  author = {M. A. Montañés and E. Torres and J Martínez-Rincón and J. E. Herrero-Jaraba},
  title = {Real-Time GPU Color-Based Segmentation of Football Players},
  journal = {Journal of Real-Time Image Processing, Springer},
  year = {2012},
  volume = {7},
  pages = {267-279},
  doi = {10.1007/s11554-011-0194-9}
}
L.C. Aparicio, J. Segarra, C. Rodríguez and V. Viñals (2011), "Improving the WCET computation in the presence of a lockable instruction cache in multitasking real-time systems", Journal of Systems Architecture., August, 2011. Vol. 57(7), pp. 695-706. Elsevier.
Abstract: In multitasking real-time systems it is required to compute the WCET of each task and also the effects of interferences between tasks in the worst case. This is very complex with variable latency hardware, such as instruction cache memories, or, to a lesser extent, the line buffers usually found in the fetch path of commercial processors. Some methods disable cache replacement so that it is easier to model the cache behavior. The difficulty in these cache-locking methods lies in obtaining a good selection of the memory lines to be locked into cache. In this paper, we propose an ILP-based method to select the best lines to be loaded and locked into the instruction cache at each context switch (dynamic locking), taking into account both intra-task and inter-task interferences, and we compare it with static locking. Our results show that, without cache, the spatial locality captured by a line buffer doubles the performance of the processor. When adding a lockable instruction cache, dynamic locking systems are schedulable with a cache size between 12.5% and 50% of the cache size required by static locking. Additionally, the computation time of our analysis method is not dependent on the number of possible paths in the task. This allows us to analyze large codes in a relatively short time (100 KB with 1065 paths in less than 3 min).
BibTeX:
@article{Aparicio10Improving,
  author = {L. C. Aparicio and J. Segarra and C. Rodríguez and V. Viñals},
  title = {Improving the WCET computation in the presence of a lockable instruction cache in multitasking real-time systems},
  journal = {Journal of Systems Architecture},
  publisher = {Elsevier},
  year = {2011},
  volume = {57},
  number = {7},
  pages = {695--706},
  doi = {10.1016/j.sysarc.2010.08.008}
}
A. Bosque, V. Viñals, P. Ibañez and J. Llabería (2011), "Filtering Directory Lookups in CMPs with Write-Through Caches", In Euro-Par 2011 Parallel Processing - 17th International Conference. LNCS 6852., September, 2011. Vol. 6852/2011, pp. 269-281. Springer.
Abstract: In CMPs, coherence protocols are used to maintain data coherence among the multiple local caches. In this paper, we focus on CMPs using write-through local caches, and a directory-based coherence protocol implemented as a duplicate of the local cache tags. A large fraction of directory lookups is due to stores performed on private data local to the processor performing the store. We propose to add a filter before the directory in order to either reduce the associativity of the lookups or even eliminate those that are unnecessary. When a block from the shared cache has only one copy in the local caches, the filter identifies the processor and allows for reducing the number of comparisons performed in the corresponding directory lookup. When that is not possible, the filter bits are used to code other situations that can also reduce the number of directory lookups or their associativity. We evaluate the fillter in a CMP with 8 in-order processors with 4 threads each and a memory hierarchy with local caches and a shared cache. We show that a filter representing 0.7% of the size of the shared cache can avoid, on average, 97% and 93% of all comparisons performed by directory lookups for SPLASH2 and Specweb2005, respectively. Only for SPLASH2, there is a small performance loss of 0.3%. As a result, on average, directory power is reduced 30.8% and 22.4% for SPLASH2 and Specweb2005, respectively.
BibTeX:
@inproceedings{Bosque2011,
  author = {A. Bosque and V. Viñals and P. Ibañez and J.M. Llabería},
  title = {Filtering Directory Lookups in CMPs with Write-Through Caches},
  booktitle = {Euro-Par 2011 Parallel Processing - 17th International Conference. LNCS 6852},
  publisher = {Springer},
  year = {2011},
  volume = {6852/2011},
  pages = {269-281}
}
A. Bosque, V. Viñals, P. Ibañez and J. Llabería (2011), "Filtering Directory Lookups in CMPs", Microprocessors and Microsystems. Design and Verification of Complex Digital Systems. Vol. vol. 35, n. 8, pp. 695-707. Elsevier.
Abstract: Coherence protocols consume an important fraction of power to determine which coherence action to perform. Specifically, on CMPs with shared cache and directory-based coherence protocol implemented as a duplicate of local caches tags, we have observed that a big fraction of directory lookups cause a miss,because the block looked up is not allocated in any local cache. To reduce the number of directory lookups and therefore the power consumption, we propose to add a filter before the directory access. We introduce two filter implementations. In the first one, filtering information is explicitly kept in the shared cache for every block. In the second one, filtering information is decoupled from the shared cache organization, so the filter size does not depend on the shared cache size. We evaluate our filters in a CMP with 8 in-order processors with 4 threads each and a memory hierarchy with write-through local caches and a shared cache. We show that, for SPLASH2 benchmarks, the proposed filters reduce the number of directory lookups performed by 60% while power consumption is reduced by 28%. For Specweb2005, the number of directory lookups performed is reduced by 68% (44%), while directory power consumption is reduced by 19% (9%) using the first (second) filter implementation.
BibTeX:
@article{Bosque2011b,
  author = {A. Bosque and V. Viñals and P. Ibañez and J.M. Llabería},
  title = {Filtering Directory Lookups in CMPs},
  journal = {Microprocessors and Microsystems. Design and Verification of Complex Digital Systems},
  publisher = {Elsevier},
  year = {2011},
  volume = {vol. 35, n. 8},
  pages = {695-707},
  doi = {0.1016/j.micpro.2011.08.006}
}
M.J. Ramón, E.L. Pueyo, J.L. Briz, A. Pocoví and J.C. Ciria (2011), "Flexural unfolding of horizons using paleomagnetic vectors", Journal of Structural Geology. Vol. 35, pp. 28-39. Elsevier.
Abstract: This paper introduces a new restoration method (Pmag3Drest) designed for complex folded structures (non-cylindrical, non-coaxial). It combines paleomagnetic vectors and bedding markers setting up a reference system that allows deformed and undeformed surfaces to be related to one another. We assume flexural conditions during the deformation. Consequently, the stratigraphic horizons are considered to be globally developable surfaces with total area preservation except in specific deformation areas. Using paleomagnetism in the proposed restoration process (Pmag3Drest) helps to locate these areas with greater accuracy. It is similar to other approaches based on triangulations, but it forces the available paleomagnetic data to converge with the paleomagnetic reference vector during the restoration process. Our experiments use computer and analog models in which the deformed and undeformed surfaces are perfectly known. This enables us to apply the restoration method to the deformed surface and compare the parameters of the restored surface with those of the initial undeformed surface to quantify the quality of the method. Paleomagnetic data anchor the surface leading to more accurate results.
BibTeX:
@article{Briz2011Flexural,
  author = {Mª José Ramón and Emilio L. Pueyo and José Luis Briz and Andrés Pocoví and José Carlos Ciria},
  title = {Flexural unfolding of horizons using paleomagnetic vectors},
  journal = {Journal of Structural Geology},
  publisher = {Elsevier},
  year = {2011},
  volume = {35},
  pages = {28--39},
  doi = {10.1016/j.jsg.2011.11.015}
}
L.M. Ramos, J.L. Briz, P.E. Ibáñez and V. Viñals (2011), "Multi-level Adaptive Prefetching based on Performance Gradient Tracking", The Journal of Instruction-Level Parallelism., January, 2011. Vol. 13, pp. 1-14.
Abstract: We introduce a multi-level prefetching framework with three setups, respectively aimed to minimize cost (Mincost), minimize losses in individual applications (Minloss) or maximize performance with moderate cost (Maxperf). Performance is boosted in all cases by a sequential tagged prefetcher in the L1 cache, with an effective static degree policy. In both cache levels (L1 and L2), we also apply prefetch filters. In the L2 cache we use a novel adaptive policy that selects the best prefetching degree within a fixed set of values, by tracking the performance gradient. Mincost resorts to sequential tagged prefetching in the L2 cache as well. Minloss relies on an accurate, home-made, correlating prefetcher (PDFCM, Differencial Finite Context Method Prefetcher). Maxperf maximizes performance at the expense of slight performance losses in a small number of benchmarks, by integrating a sequential tagged prefetcher with PDFCM in the L2 cache.
BibTeX:
@article{Ramos2011,
  author = {L. M. Ramos and J. L. Briz and P. E. Ibáñez and Victor Viñals},
  title = {Multi-level Adaptive Prefetching based on Performance Gradient Tracking},
  journal = {The Journal of Instruction-Level Parallelism},
  year = {2011},
  volume = {13},
  pages = {1-14},
  url = {www.jilp.org/vol13}
}
C. González, J. Resano, A. Plaza and D. Mozos (2011), "FPGA Implementation of Endmember Extraction Algorithms from Hyperspectral Imagery: Pixel Purity Index versus N-FINDR", In Proc. of SPIE High-Performance Computing in Remote Sensing. Praga,Republica Checa
BibTeX:
@inproceedings{Resano2011Endmember,
  author = {Carlos González and Javier Resano and Antonio Plaza and Daniel Mozos},
  title = {FPGA Implementation of Endmember Extraction Algorithms from Hyperspectral Imagery: Pixel Purity Index versus N-FINDR},
  booktitle = {Proc. of SPIE High-Performance Computing in Remote Sensing},
  year = {2011}
}
J.A. Clemente, J. Resano and D. Mozos (2011), "A Replacement Technique to Maximize Task Reuse in Reconfigurable Systems", In IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum (IPDPSW). Anchorage, USA
BibTeX:
@inproceedings{Resano2011Replacement,
  author = {Juan Antonio Clemente and Javier Resano and Daniel Mozos},
  title = {A Replacement Technique to Maximize Task Reuse in Reconfigurable Systems},
  booktitle = {IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum (IPDPSW)},
  year = {2011}
}
D. Suárez, G. Dimitrakopoulos, T. Monreal, M.G.H. Katevenis and V. Viñals (2011), "LP-NUCA: Networks-in-Cache for High- Performance Low-Power Embedded Processors", IEEE Transactions on Very Large Scale Integration (VLSI) systems.
Abstract: High-end embedded processors demand complex on-chip cache hierarchies satisfying several contradicting design requirements such as high-performance operation and low energy consumption. This paper introduces light-power (LP) nonuniform cache architecture (NUCA), a tiled-cache addressing both goals. LP-NUCA places a group of small and low-latency tiles between the L1 and the last level cache (LLC) that adapt better to the application working sets and keep most recently evicted blocks close to L1. LP-NUCA is built around three specialized ?networks- in-cache,? each aimed at a separate cache operation. To prove the design feasibility, we have fully implemented LP-NUCA in a 90-nm technology. From the VLSI implementation, we observe that the proposed networks-in-cache incur minimal area, latency, and power overhead. To further reduce the energy consumption, LP-NUCA employs two network-wide techniques (miss wave stopping and sectoring) that together reduce the dynamic cache energy by 35% without degrading performance. Our evaluations also show that LP-NUCA improves performance with respect to cache hierarchies similar to those found in high-end embedded processors. Similar results have been obtained after scaling to a 32-nm technology.
BibTeX:
@article{Suarez2011,
  author = {Darío Suárez and Giorgos Dimitrakopoulos and Teresa Monreal and Manolis G. H. Katevenis and Víctor Viñals},
  title = {LP-NUCA: Networks-in-Cache for High- Performance Low-Power Embedded Processors},
  journal = {IEEE Transactions on Very Large Scale Integration (VLSI) systems},
  year = {2011}
}
J. Agustí, I. Pellejero, G. Abadal, G. Murillo, M. Urbiztondo, J. Sesé, M. Villarroya-Gaudo, M. Pina, J. Santamaría and N. Barniol (2010), "Optical vibrometer for mechanical properties characterization of silicalite-only cantilever based sensors", Microelectronic Engineering. Vol. 87(5-8), pp. 1207 - 1209.
BibTeX:
@article{Agustí20101207,
  author = {J. Agustí and I. Pellejero and G. Abadal and G. Murillo and M.A. Urbiztondo and J. Sesé and M. Villarroya-Gaudo and M. Pina and J. Santamaría and N. Barniol},
  title = {Optical vibrometer for mechanical properties characterization of silicalite-only cantilever based sensors},
  journal = {Microelectronic Engineering},
  year = {2010},
  volume = {87},
  number = {5-8},
  pages = {1207 - 1209},
  note = {The 35th International Conference on Micro- and Nano-Engineering (MNE)},
  url = {http://www.sciencedirect.com/science/article/pii/S016793170900851X},
  doi = {10.1016/j.mee.2009.12.009}
}
L.C. Aparicio, J. Segarra, C. Rodríguez and V. Viñals (2010), "Combining prefetch with instruction cache locking in multitasking real-time systems", In Proceedings of the IEEE Int. Conf. on Embedded and Real-Time Computing Systems and Applications. Macau SAR, China, August, 2010. , pp. 319-328. IEEE Computer Society Press.
Abstract: In multitasking real-time systems it is required to compute the WCET of each task and also the effects of interferences between tasks in the worst case. This is complex with variable latency hardware usually found in the fetch path of commercial processors. Some methods disable cache replacement so that it is easier to model the cache behavior. Lock-MS is an ILP based method to obtain the best selection of memory lines to be locked in a dynamic locking instruction cache. In this paper we first propose a simple memory architecture implementing the next-line tagged prefetch, specially designed for hard real-time systems. Then, we extend Lock-MS to add support for hardware instruction prefetch. Our results show that the WCET of a system with prefetch and an instruction cache with size 5% of the total code size is better than that of a system having no prefetch and cache size 80% of the code. We also evaluate the effects of the prefetch penalty on the resulting WCET, showing that a system without prefetch penalties has a worst-case performance 95% of the ideal case. This highlights the importance of a good prefetch design. Finally, the computation time of our analysis method is relatively short, analyzing tasks of 96~KB with $10^65$~paths in less than 3~minutes.
BibTeX:
@inproceedings{Aparicio10Combining,
  author = {L. C. Aparicio and J. Segarra and C. Rodríguez and V. Viñals},
  title = {Combining prefetch with instruction cache locking in multitasking real-time systems},
  booktitle = {Proceedings of the IEEE Int. Conf. on Embedded and Real-Time Computing Systems and Applications},
  publisher = {IEEE Computer Society Press},
  year = {2010},
  pages = {319--328},
  doi = {10.1109/RTCSA.2010.8}
}
A. Bosque, V. Viñals, P. Ibañez and J. Llaberia (2010), "Filtering Directory Lookups in CMPs", In Proc. 13th Euromicro Conf. Digital System Design: Architectures, Methods and Tools (DSD). , pp. 207-216.
Abstract: Coherence protocols consume an important fraction of power to determine which coherence action should take place. In this paper we focus on CMPs with a shared cache and a directory-based coherence protocol implemented as a duplicate of local caches tags. We observe that a big fraction of directory lookups produce a miss since the block looked up is not cached in any local cache. We propose to add a filter before the directory lookup in order to reduce the number of lookups to this structure. The filter identifies whether the current block was last accessed as a data or as an instruction. With this information, looking up the whole directory can be avoided for most accesses. We evaluate the filter in a CMP with 8 in-order processors with 4 threads each and a memory hierarchy with a shared L2 cache. We show that a filter with a size of 3% of the tag array of the shared cache can avoid more than 70% of all comparisons performed by directory lookups with a performance loss of just 0.2% for SPLASH2 and 1.5% for Specweb2005. On average, the number of 15-bit comparisons avoided per cycle is 54 out of 77 for SPLASH2 and 29 out of 41 for Specweb2005. In both cases, the filter requires less than one read of 1 bit per cycle.
BibTeX:
@inproceedings{Bosque2010,
  author = {A. Bosque and V. Viñals and P. Ibañez and J.M. Llaberia},
  title = {Filtering Directory Lookups in CMPs},
  booktitle = {Proc. 13th Euromicro Conf. Digital System Design: Architectures, Methods and Tools (DSD)},
  year = {2010},
  pages = {207--216},
  doi = {10.1109/DSD.2010.85}
}
P. Molina-Gaudo, S. Baldassarri, M. Villarroya-Gaudo and E. Cerezo (2010), "Perception and Intention in Relation to Engineering: A Gendered Study Based on a One-Day Outreach Activity", #IEEE_J_EDU#. Vol. 53(1), pp. 61-70.
Abstract: This paper explores both how male and female high school pupils (15-16 years old) perceive the engineering profession and their willingness to pursue a career in this area. A study was performed around a one-day outreach activity, Girls' Day, organized for the first time in Spain. During Girls' Day, students were exposed to specific activities developed for them in engineering research labs and companies, carried out by young female researchers and professionals. The study, based on two questionnaires answered before and after the activity, focuses on the differences between groups of female and male students having differing degrees of interest in studying engineering. The educational level of mothers, the presence of engineers in families, and perceived family support emerged as important factors influencing the probability of a young person's considering pursuing engineering studies. Nevertheless, the need to expose children to outreach activities at a younger age and to involve the students' families and teachers has become clear. If planned properly and thoughtfully, even a single day's experience can contribute to changing the perception of what an engineer is.
BibTeX:
@article{Molina-Gaudo2010,
  author = {Molina-Gaudo, P. and Baldassarri, S. and Villarroya-Gaudo, M. and Cerezo, E.},
  title = {Perception and Intention in Relation to Engineering: A Gendered Study Based on a One-Day Outreach Activity},
  journal = {#IEEE_J_EDU#},
  year = {2010},
  volume = {53},
  number = {1},
  pages = {61--70},
  doi = {10.1109/TE.2009.2023910}
}
J. Olivito, C. González and J. Resano (2010), "FPGA implementation of a strong Reversi player", In International Conference on Field-Progammable Technology. Beijing, China, December, 2010. Vol. ISBN 978-1-4244-8980-0, pp. 507-510.
BibTeX:
@inproceedings{Olivito10Reversi,
  author = {Olivito, J. and Carlos González and Javier Resano},
  title = {FPGA implementation of a strong Reversi player},
  booktitle = {International Conference on Field-Progammable Technology},
  year = {2010},
  volume = {ISBN 978-1-4244-8980-0},
  pages = {507--510}
}
C. Gonzalez, J. Resano, D. Mozos, A. Plaza and D. Valencia (2010), "FPGA Implementation of the Pixel Purity Index Algorithm for Remotely Sensed Hyperspectral Image Analysis", EURASIP Journal on Advances in Signal Processing. Vol. 2010, pp. 1-13. HINDAWI.
BibTeX:
@article{Resano10c,
  author = {Carlos Gonzalez and Javier Resano and Daniel Mozos and Antonio Plaza and David Valencia},
  title = {FPGA Implementation of the Pixel Purity Index Algorithm for Remotely Sensed Hyperspectral Image Analysis},
  journal = {EURASIP Journal on Advances in Signal Processing},
  publisher = {HINDAWI},
  year = {2010},
  volume = {2010},
  pages = {1--13}
}
J. Clemente, C. González, J. Resano and D. Mozos (2010), "A task graph execution manager for reconfigurable multi-tasking systems", Microprocessors and Microsystems. Vol. 34-Issues 2-4, pp. 73-83. Elsevier.
BibTeX:
@article{Resano2010a,
  author = {J.A. Clemente and C. González and J. Resano and D. Mozos},
  title = {A task graph execution manager for reconfigurable multi-tasking systems},
  journal = {Microprocessors and Microsystems},
  publisher = {Elsevier},
  year = {2010},
  volume = {34-Issues 2-4},
  pages = {73--83}
}
J.A. Clemente, J. Resano, C. Gonzalez and D. Mozos (2010), "A Hardware Implementation of a Run-Time Scheduler for Reconfigurable Systems", IEEE Transactions on Very Large Scale Integration (VLSI) Systems. Vol. 19, NO. 7, pp. 1263-1276. IEEE.
BibTeX:
@article{Resano2010b,
  author = {Juan Antonio Clemente and Javier Resano and Carlos Gonzalez and Daniel Mozos},
  title = {A Hardware Implementation of a Run-Time Scheduler for Reconfigurable Systems},
  journal = {IEEE Transactions on Very Large Scale Integration (VLSI) Systems},
  publisher = {IEEE},
  year = {2010},
  volume = {19, NO. 7},
  pages = {1263--1276}
}
C. González, J. Resano, A. Plaza and D. Mozos (2010), "FPGA for computing the pixel purity index algorithm on hyperspectral images", In The 2010 International International conference on Engineering of Reconfigurable Systems and Algorithms (ERSA). Las Vegas, USA
BibTeX:
@inproceedings{Resano2010ComputingPixelPurity,
  author = {Carlos González and Javier Resano and Antonio Plaza and Daniel Mozos},
  title = {FPGA for computing the pixel purity index algorithm on hyperspectral images},
  booktitle = {The 2010 International International conference on Engineering of Reconfigurable Systems and Algorithms (ERSA)},
  year = {2010}
}
M.A. Montañés, E. Torres, J. Martinez and J.E. Herrero (2010), "Scalability of Color-Based Segmentation of Football Players over GPUs", In 2010 International Workshop on GPUs and Scientific Applications (GPUScA 2010) in conjunction with PACT 2010. , pp. 27-35. TR of the Department of Scientific Computing, University of Vienna, TR-10-3.
Abstract: In this paper, we study the scalability of a real application to the available number of cores in the GPU. Our application is a real-time image processing in which a football player feature extractor based in color patterns obtain feasible measures for tracking system. Since football players are composed for diverse and complex color patterns, a Gaussian Mixture Models (GMM) is applied as segmentation paradigm. Optimization techniques have also been applied over the C++ implementation using profiling tools focused on high performance. Time consuming tasks were implemented over NVIDIA?s CUDA platform, and later restructured and enhanced, speeding up the whole process significantly. Our resulting code is around 4-11 times faster on a low cost GPU than a highly optimized C++ version on a central processing unit (CPU) over the same data. The optimized application has been benchmarked over different GPUs with different number of cores. Due to data dependencies performance increase 1.4x when doubling number of cores.
BibTeX:
@inproceedings{Torres2010,
  author = {M. A. Montañés and E. Torres and J. Martinez and J. E. Herrero},
  title = {Scalability of Color-Based Segmentation of Football Players over GPUs},
  booktitle = {2010 International Workshop on GPUs and Scientific Applications (GPUScA 2010) in conjunction with PACT 2010},
  publisher = {TR of the Department of Scientific Computing, University of Vienna, TR-10-3},
  year = {2010},
  pages = {27-35}
}
C. González, J. Olivito and J. Resano (2009), "An initial specific processor for Sudoku solving", In International Conference on Field-Programmable Technology. Sydney, Australia, December, 2009. Vol. ISBN 978-1-4244-4375-8, pp. 530-533.
BibTeX:
@inproceedings{Gonzalez09Sudokus,
  author = {Carlos González and Javier Olivito and Javier Resano},
  title = {An initial specific processor for Sudoku solving},
  booktitle = {International Conference on Field-Programmable Technology},
  year = {2009},
  volume = {ISBN 978-1-4244-4375-8},
  pages = {530--533}
}
R. Gran, E. Morancho, A. Olive and J.M. Llabería (2009), "On reducing misspeculations in a pipelined scheduler", In Proc. IEEE Int. Symp. Parallel & Distributed Processing (IPDPS 2009). , pp. 1-12.
Abstract: Pipelining the scheduling logic, which exposes and exploits the instruction level parallelism, degrades processor performance. In a 4-issue processor, our evaluations show that pipelining the scheduling logic over two cycles degrades performance by 10% in SPEC-2000 integer benchmarks. Such a performance degradation is due to sacrificing the ability to execute dependent instructions in consecutive cycles. Speculative selection is a previously proposed technique that boosts the performance of a processor with a pipelined scheduling logic. However, this new speculation source increases the overall number of misspeculated instructions, and this unuseful work wastes energy. In this work we introduce a non-speculative mechanism named Dependence Level Scheduler (DLS) which not only tolerates the scheduling-logic latency but also reduces the number of misspeculated instructions with respect to a scheduler with speculative selection. In DLS, the selection of a group of one-cycle instructions (producer-level) is overlapped with the wake up in advance of its group of dependent instructions. DLS is not speculative because the group of woken in advance instructions will compete for selection only after issuing all producer-level instructions. On average, DLS reduces the number of misspeculated instructions with respect to a speculative scheduler by 17.9%. From the IPC point of view, the speculative scheduler outperforms DLS by 0.3%. Moreover, we propose two non-speculative improvements to DLS.
BibTeX:
@inproceedings{Gran2009,
  author = {Gran, R. and Morancho, E. and Olive, A. and Llabería, J. M.},
  title = {On reducing misspeculations in a pipelined scheduler},
  booktitle = {Proc. IEEE Int. Symp. Parallel & Distributed Processing (IPDPS 2009)},
  year = {2009},
  pages = {1--12},
  doi = {10.1109/IPDPS.2009.5160990}
}
S. Gutiérrez, O. Benedí, D. Suárez, J. Marín and V. Viñals (2009), "Processor Energy and Temperature in Computer Architecture Courses: a hands-on approach", In Workshop on Computer Architecture Education Held in conjunction with the 42nd Annual International Symposium on Microarchitecture, New York, United States, December 13, 2009..
Abstract: Performance has driven the microprocessor industry for more than thirty years. Its effort has enabled to multiply by several orders of magnitude the computational power; e.g., the Intel 8080 was able to execute 0.64 MIPS and the newest Core i7 can execute 6400 MIPS. The cost of this fabulous improvement has been a large rise in energy consumption. Nowadays, we have reached a point where one of the most limiting factor for improving performance is energy dissipation. In order to keep the performance improvement during the next years, it is necessary to study energy and temperature in deep. Nevertheless, most current computer architecture curricula include neither energy nor temperature. The lack of adequate experimental platforms contributes to the difficulty in teaching these topics. In this paper we propose a possible solution: to instrument a commodity PC for measuring the processor power and temperature during the execution of real programs. The platform is devised for teaching, but it can be used to support research experiments as well. For example, we describe an interesting undergraduate laboratory that analyzes the interaction between compiler optimizations and energy. With this laboratory, students can learn that performance optimizations usually reduce energy but may increase power.
BibTeX:
@inproceedings{Gutierrez2009,
  author = {S. Gutiérrez and O. Benedí and D. Suárez and J.M. Marín and V. Viñals},
  title = {Processor Energy and Temperature in Computer Architecture Courses: a hands-on approach},
  booktitle = {Workshop on Computer Architecture Education Held in conjunction with the 42nd Annual International Symposium on Microarchitecture, New York, United States, December 13, 2009.},
  year = {2009}
}
A. Muzahid, D. Suárez, S. Qi and J. Torrellas (2009), "SigRace: signature-based data race detection", In ISCA '09: Proceedings of the 36th annual international symposium on Computer architecture. New York, NY, USA , pp. 337-348. ACM.
Abstract: Detecting data races in parallel programs is important for both software development and production-run diagnosis. Recently, there have been several proposals for hardware-assisted data race detection. Such proposals typically modify the L1 cache and cache coherence protocol messages, and largely lose their capability when lines get displaced or invalidated from the cache. To eliminate these shortcomings, this paper proposes a novel, different approach to hardware-assisted data race detection. The approach, called SigRace, relies on hardware address signatures. As a processor runs, the addresses of the data that it accesses are automatically encoded in signatures. At certain times, the signatures are automatically passed to a hardware module that intersects them with those of other processors. If the intersection is not null, a data race may have occurred. This paper presents the architecture of SigRace, an implementation, and its software interface. With SigRace, caches and coherence protocol messages are unmodified. Moreover, cache lines can be displaced and invalidated with no effect. Our experiments show that SigRace is significantly more effective than a state-of-the-art conventional hardware-assisted race detector. SigRace finds on average 29% more static races and 107% more dynamic races. Moreover, if we inject data races, SigRace finds 150% more static races than the conventional scheme.
BibTeX:
@inproceedings{Muzahid2009,
  author = {Muzahid, Abdullah and Suárez, Dario and Qi, Shanxiang and Torrellas, Josep},
  title = {SigRace: signature-based data race detection},
  booktitle = {ISCA '09: Proceedings of the 36th annual international symposium on Computer architecture},
  publisher = {ACM},
  year = {2009},
  pages = {337--348},
  doi = {10.1145/1555754.1555797}
}
L.M. Ramos, J.L. Briz, P.E. Ibáñez and V. Viñals (2009), "Multi-level Adaptive Prefetching based on Performance Gradient Tracking", 1st. Data Prefetching Championship. Raleigh, North Carolina, February, 2009.
BibTeX:
@inproceedings{Ramos2009,
  author = {Ramos, Luis M. and Briz, José Luis and Ibáñez, Pablo E. and Viñals, Victor},
  title = {Multi-level Adaptive Prefetching based on Performance Gradient Tracking},
  year = {2009},
  note = {Held in conjunction with the 15th International Symposium on High-Performance Computer Architecture 8HIPEAC-15). Best Paper Award},
  url = {http://www.jilp.org/dpc/online/DPC-1%20Program.htm}
}
C. González, J. Resano and D. Mozos (2009), "FPGA Support for Satellite Computations of Hyper Spectral Images", In 19th International Conference on Field Programmable Logic and Applications (FPL). Praga, República Checa
BibTeX:
@inproceedings{Resano2009SupportSatellite,
  author = {Carlos González and Javier Resano and Daniel Mozos},
  title = {FPGA Support for Satellite Computations of Hyper Spectral Images},
  booktitle = {19th International Conference on Field Programmable Logic and Applications (FPL)},
  year = {2009}
}
B. Sahelices, P. Ibáñez, V. Viñals and J. Llabería (2009), "A Methodology to Characterize Critical Section Bottlenecks in DSM Multiprocessors", In Euro-Par 2009 Parallel Processing. 15th International Euro-Par Conference, Delft, The Netherlands, August 25--28, 2009. Vol. LN 5704, pp. 149-161. Springer Berlin / Heidelberg.
BibTeX:
@inproceedings{Sahelices2009,
  author = {Sahelices, B. and Ibáñez, P. and Viñals, V. and Llabería, J.M.},
  title = {A Methodology to Characterize Critical Section Bottlenecks in DSM Multiprocessors},
  booktitle = {Euro-Par 2009 Parallel Processing. 15th International Euro-Par Conference, Delft, The Netherlands, August 25--28, 2009},
  publisher = {Springer Berlin / Heidelberg},
  year = {2009},
  volume = {LN 5704},
  pages = {149--161}
}
D. Suárez, T. Monreal, F. Vallejo, R. Beivide and V. Viñals (2009), "Light NUCA: A proposal for bridging the inter-cache latency gap", In Proc. Design, Automation & Test in Europe Conference & Exhibition (DATE '09). , pp. 530-535.
Abstract: To deal with the “memory wall” problem, microprocessors include large secondary on-chip caches. But as these caches enlarge, they originate a new latency gap between them and fast L1 caches (inter-cache latency gap). Recently, Non-Uniform Cache Architectures (NUCAs) have been proposed to sustain the size growth trend of secondary caches that is threatened by wire-delay problems. NUCAs are size-oriented, and they were not conceived to close the inter-cache latency gap. To tackle this problem, we propose Light NUCAs (L-NUCAs) leveraging on-chip wire density to interconnect small tiles through specialized networks, which convey packets with distributed and dynamic routing. Our design reduces the tile delay (cache access plus one-hop routing) to a single processor cycle and places cache lines at a finer granularity than conventional caches, reducing cache latency. Our evaluations show that in general, an L-NUCA improves simultaneously performance, energy, and area when integrated into both conventional or D-NUCA hierarchies.
BibTeX:
@inproceedings{Suarez2009,
  author = {Darío Suárez and Teresa Monreal and Fernando Vallejo and Ramon Beivide and Víctor Viñals},
  title = {Light NUCA: A proposal for bridging the inter-cache latency gap},
  booktitle = {Proc. Design, Automation & Test in Europe Conference & Exhibition (DATE '09)},
  year = {2009},
  pages = {530--535},
  url = {http://www.date-conference.com/archive/conference/proceedings/PAPERS/2009/DATE09/PDFFILES/05.7_1.PDF}
}
E. Torres, P. Ibañez, V. Viñals and J.M. Llaberia (2009), "Store Buffer Design for Multibanked Data Caches", Transactions on Computers (TC 2009). Vol. 58(10), pp. 1307-1320.
Abstract: This paper focuses on how to design a store buffer (STB) well suited to first-level multibanked data caches. The goal is to forward data from in-flight stores into dependent loads within the latency of a cache bank. Taking into account the store lifetime in the processor pipeline and the data forwarding behavior, we propose a particular two-level STB design in which forwarding is done speculatively from a distributed first-level STB made of extremely small banks, whereas a centralized, second-level STB enforces correct store-load ordering. Besides, the two-level STB admits two simplifications that leave performance almost unchanged. Regarding the second-level STB, we suggest to remove its data forwarding capability, while for the first-level STB, it is possible to: 1) remove the instruction age checking and 2) compare only the less significant address bits. Experimentation covers both integer and floating point codes executing in dynamically scheduled processors. Following our guidelines and running SPEC-2K over an 8-way processor, a two-level STB with four 8-entry banks in the first level performs similar to an ideal, single-level STB with 128-entry banks working at the first-level cache latency. Also, we show that the proposed two-level design is suitable for a memory-latency-tolerant processor.
BibTeX:
@article{Torres2009,
  author = {Torres, E. and Ibañez, P. and Viñals, V. and Llaberia, J. M.},
  title = {Store Buffer Design for Multibanked Data Caches},
  journal = {Transactions on Computers (TC 2009)},
  year = {2009},
  volume = {58},
  number = {10},
  pages = {1307--1320},
  doi = {10.1109/TC.2009.57}
}
M. Urbiztondo, I. Pellejero, M. Villarroya-Gaudo, J. Sesé, M. Pina, I. Dufour and J. Santamaría (2009), "Zeolite-modified cantilevers for the sensing of nitrotoluene vapors", Sensors and Actuators B: Chemical. Vol. 137(2), pp. 608 - 616.
BibTeX:
@article{Urbiztondo2009608,
  author = {M.A. Urbiztondo and I. Pellejero and M. Villarroya-Gaudo and J. Sesé and M.P. Pina and I. Dufour and J. Santamaría},
  title = {Zeolite-modified cantilevers for the sensing of nitrotoluene vapors},
  journal = {Sensors and Actuators B: Chemical},
  year = {2009},
  volume = {137},
  number = {2},
  pages = {608 - 616},
  url = {http://www.sciencedirect.com/science/article/pii/S0925400509000549},
  doi = {10.1016/j.snb.2009.01.047}
}
J. Alastruey, T. Monreal, F. Cazorla, V. Viñals and M. Valero (2008), "Selection of the Register File Size and the Resource Allocation Policy on SMT Processors", In Proc. 20th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD '08)., oct 29 -- nov 01, 2008. , pp. 63-70.
Abstract: The performance impact of the Physical Register File(PRF) size on Simultaneous Multithreading processors has not been extensively studied in spite of being a critical shared resource. In this paper we analyze the effect on performance of the PRF size for a broad set of resource allocation policies (Icount, Stall, Flush, Flush++, Static,Dcra and Hill-climbing) and evaluate them under two metrics: instructions per second (IPS) for throughput and harmonic mean of weighted IPCs (Hmean-wIPC) for fairness. We have found that resource allocation policy and PRF size should be considered together in order to obtain the best score in the proposed metrics. For instance, for the analyzed 2 and 4-threaded SPEC CPU2000 workloads,small PRFs are best managed by Flush, whereas for larger PRFs, Hill-climbing and Static lead to the best values for the throughput and fairness metrics, respectively.The second contribution of this work is a simple procedure that, for a given resource allocation policy, selects the PRF size that maximizes IPS and obtains for Hmean-wIPC a value close to its maximum. According to our results, Hill-climbing with a 320-entry PRF achieves the best figures for 2-threaded workloads. When executing 4-threaded workloads, Hill-Climbing with a 384-entry PRF achieves the best throughput whereas Static obtains the best throughput-fairness balance.
BibTeX:
@inproceedings{Alastruey2008,
  author = {Alastruey, J. and Monreal, T. and Cazorla, F. and Viñals, V. and Valero, M.},
  editor = {IEEE Computer Society},
  title = {Selection of the Register File Size and the Resource Allocation Policy on SMT Processors},
  booktitle = {Proc. 20th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD '08)},
  year = {2008},
  pages = {63--70},
  doi = {10.1109/SBAC-PAD.2008.17}
}
L.C. Aparicio, J. Segarra, C. Rodríguez, J.L. Villarroel and V. Viñals (2008), "Avoiding the WCET Overestimation on LRU Instruction Cache", In Proc. 14th IEEE International Conference on Embedded and Real-Time Computing Systems and Applications (RTCSA 08). Kaohsiung, Taiwan, August, 2008. , pp. 393-398. IEEE Computer Society Press.
Abstract: The WCET computation is one of the main challenges in hard real-time systems, since all further analysis is based on this value. The complexity of this problem leads existing analysis methods to compute WCET bounds instead of the exact WCET. In this work we propose a technique to compute the exact instruction fetch contribution to the WCET (IFC-WCET) in presence of a LRU instruction cache. We prove that an exact computation does not need to analyze the full exponential number of possible execution paths, but only a bounded subset of them. In the benchmark codes we have studied, the IFC-WCET is up to 62% lower than a bound computed with a widely used approach, and the difference between the number of possible execution paths and the ones relevant for the analysis is extremely large.
BibTeX:
@inproceedings{Aparicio08Avoiding,
  author = {Aparicio, L. C. and Segarra, J. and Rodríguez, C. and Villarroel, J. L. and Viñals, V.},
  title = {Avoiding the WCET Overestimation on LRU Instruction Cache},
  booktitle = {Proc. 14th IEEE International Conference on Embedded and Real-Time Computing Systems and Applications (RTCSA 08)},
  publisher = {IEEE Computer Society Press},
  year = {2008},
  pages = {393--398},
  doi = {10.1109/RTCSA.2008.10}
}
I. Pellejero, M. Urbiztondo, M. Villarroya-Gaudo, J. Sesé, M. Pina and J. Santamaría (2008), "Development of etching processes for the micropatterning of silicalite films", Microporous and Mesoporous Materials. Vol. 114(1?3), pp. 110 - 120.
BibTeX:
@article{Pellejero2008110,
  author = {I. Pellejero and M. Urbiztondo and Villarroya-Gaudo, M. and J. Sesé and M.P. Pina and J. Santamaría},
  title = {Development of etching processes for the micropatterning of silicalite films},
  journal = {Microporous and Mesoporous Materials},
  year = {2008},
  volume = {114},
  number = {1?3},
  pages = {110 - 120},
  url = {http://www.sciencedirect.com/science/article/pii/S1387181107007445},
  doi = {10.1016/j.micromeso.2007.12.023}
}
L. Ramos, J. Briz, P. Ibáñez and V. Viñals (2008), "Low-Cost Adaptive Data Prefetching", In Euro-Par 2008 Parallel Processing. Vol. 5168, pp. 327-336. Springer Berlin / Heidelberg.
BibTeX:
@inproceedings{Ramos2008,
  author = {Ramos, Luis and Briz, José and Ibáñez, Pablo and Viñals, Víctor},
  editor = {Luque, Emilio and Margalef, Tomàs and Benítez, Domingo},
  title = {Low-Cost Adaptive Data Prefetching},
  booktitle = {Euro-Par 2008 Parallel Processing},
  publisher = {Springer Berlin / Heidelberg},
  year = {2008},
  volume = {5168},
  pages = {327-336},
  note = {Tasa de aceptación=34% springerlink:10.1007/978-3-540-85451-7_36},
  url = {http://dx.doi.org/10.1007/978-3-540-85451-7_36}
}
J. Resano, D. Mozos, F. Catthoor, J.A. Clemente and C. González (2008), "Efficiently scheduling run-time reconfigurations", Transactions on Design Automation of Electronic Systems. Vol. 13, nº 4, pp. 58.1-58.12. ACM.
BibTeX:
@article{Resano2008Scheduling,
  author = {Javier Resano and Daniel Mozos and Francky Catthoor and Juan Antonio Clemente and Carlos González},
  title = {Efficiently scheduling run-time reconfigurations},
  journal = {Transactions on Design Automation of Electronic Systems},
  publisher = {ACM},
  year = {2008},
  volume = {13, nº 4},
  pages = {58.1--58.12}
}
J.A. Clemente, C. González, J. Resano and D. Mozos (2008), "A Hardware Task-Graph Scheduler for Reconfigurable Multi-tasking Systems", In International Conference on ReConFigurable Computing and FPGAs (ReConFig). Cancún, Quintana Roo, México
BibTeX:
@inproceedings{Resano2008TaskGraph,
  author = {Juan Antonio Clemente and Carlos González and Javier Resano and Daniel Mozos},
  title = {A Hardware Task-Graph Scheduler for Reconfigurable Multi-tasking Systems},
  booktitle = {International Conference on ReConFigurable Computing and FPGAs (ReConFig)},
  year = {2008}
}
V. Cholvi and J. Segarra (2008), "Analysis and placement of storage capacity in large distributed video servers", Computer Communications. Vol. 31(15), pp. 3604-3612. Elsevier.
Abstract: In this paper, we study how to distribute storage capacity along a hierarchical system with cache-servers located at each node. This system is intended to deliver stored video streams in a video-on-demand way, ensuring that, once started, a transmission will be completed without any delay or quality loss. We use off-line smoothing for videos, dividing them into CBR video parts. Also, our request rates are distributed following a 24 h audience curve. In this system, when a request is received, the server reserves the required bandwidth at the required time slots, trying to serve the video as soon as possible. We perform a detailed analysis by means of simulations of the start-up time delay for some storage distributions. It shows that an adequate storage distribution can increase performance about 25% with respect to a uniform distribution and about 47% with respect to one in which all the storage is attached to the gateway routers that connect the final users. We also analyze bandwidth usage, comparing the behavior of these storage distributions. Finally, we present a method which allows dynamic and transparent video reallocations when their popularity changes.
BibTeX:
@article{Segarra08Analysis,
  author = {Vicent Cholvi and Juan Segarra},
  title = {Analysis and placement of storage capacity in large distributed video servers},
  journal = {Computer Communications},
  publisher = {Elsevier},
  year = {2008},
  volume = {31},
  number = {15},
  pages = {3604--3612},
  url = {http://www.sciencedirect.com/science/article/B6TYP-4STYTYD-6/2/2971affb87cf195fccbfe8c1021e3d6e},
  doi = {10.1016/j.comcom.2008.06.012}
}
J. Alastruey, T. Monreal, V. Viñals and M. Valero (2007), "Microarchitectural Support for Speculative Register Renaming", In Proc. IEEE International Parallel and Distributed Processing Symposium (IPDPS 2007). , pp. 1-10.
Abstract: This paper proposes and evaluates a new microarchitecture for out-of-order processors that supports speculative renaming. We call speculative renaming to the speculative omission of physical register allocation along with the speculative early release of physical registers. These renaming policies may cause a register operand not to be kept in the physical register file (PRF). Thus, we add a low-ported auxiliary register file (XRF) located outside the processor core that keeps the values absent in PRF and supplies them at higher latency. To support the location of register operands being either in PRF or XRF, we use virtual registers. We consider omission and release policies directed by hardware prediction. Namely, we use a single last-use predictor that directs both speculative omission and release. We call this mechanism SR-LUP (speculative renaming based on last-use prediction). Two last-use predictor designs of incremental complexity and performance are analyzed. In a 256-ROB, 8-way processor with an 80int+80fp PRF, SR-LUP with an 11-port 256int+256fp XRF, speeds up computations up to 11.5% and 29% for INT and FP SPEC2K benchmarks, respectively. For FP benchmarks, if the PRF limits the clock frequency, a conventionally managed 128int+128fp PRF can be replaced using SR-LUP by a 64int+64fp PRF backed up with a 10-port 224int+224fp XRF, showing 19% IPS gain.
BibTeX:
@inproceedings{Alastruey2007,
  author = {Alastruey, J. and Monreal, T. and Viñals, V. and Valero, M.},
  title = {Microarchitectural Support for Speculative Register Renaming},
  booktitle = {Proc. IEEE International Parallel and Distributed Processing Symposium (IPDPS 2007)},
  year = {2007},
  pages = {1--10},
  doi = {10.1109/IPDPS.2007.370237}
}
A. Bosque, P. Iba nez, V. Viñals, P. Stenström and J.M. Llabera (2007), "Characterization of Apache web server with Specweb2005", In MEDEA '07: Proceedings of the 2007 workshop on MEmory performance. New York, NY, USA , pp. 65-72. ACM.
Abstract: Computer manufacturers offer today multicore with multi-threading capabilities and a broad range of number of cores. An important market today for these multicores is in the server domain. Web servers are a class of servers which are widely used to provide access to files and also as front-ends of more complex services. In this paper the performance of Apache web server is characterized on multicore chips using Specweb2005 as URL request generator. This benchmark provides three workloads in order to characterize different usage environments. We also compare its performance against Surge that simulates a static web page URL request generator. We find that the L2 data miss rate per instruction is below 1.4%, more than the 60% of the misses are classified as cold or capacity misses and the true sharing misses represent between 12% and 38% of all the misses. We observe that though the data miss rate is small, accesses to main memory represent up to 42% of the execution time. By contrast the true sharing misses that could be up to 38% of all the misses, represent a small fraction of time due to the small latency of cache-to-cache transfers inside the chip.
BibTeX:
@inproceedings{Bosque2007,
  author = {Bosque, Ana and Iba nez, Pablo and Viñals, Vctor and Stenström, Per and Llabera, Jose M.},
  title = {Characterization of Apache web server with Specweb2005},
  booktitle = {MEDEA '07: Proceedings of the 2007 workshop on MEmory performance},
  publisher = {ACM},
  year = {2007},
  pages = {65--72},
  doi = {10.1145/1327171.1327179}
}
M.V. Gaudo, G. Abadal, J. Verd, J. Teva, F. Perez-Murano, E.F. Costa, J. Montserrat, A. Uranga, J. Esteve and N. Barniol (2007), "Time-Resolved Evaporation Rate of Attoliter Glycerine Drops Using On-Chip CMOS Mass Sensors Based on Resonant Silicon Micro Cantilevers", #IEEE_J_NANO#. Vol. 6(5), pp. 509-512.
Abstract: The time-resolved evaporation rate of small glycerine drops (in the attoliter range) is determined by means of a mass sensor based on a resonant cantilever integrated in a CMOS chip. The cantilever is fabricated on crystalline silicon, using silicon-on-insulator (SOI) substrates for the integration of the CMOS-MEMS. Glycerine drops are deposited at the free end of the cantilever. The high mass sensitivity of the sensor (8 ag/Hz) allows to determine the evaporation rate for glycerine drops smaller than 500 aL, which are found to be below 3.2 aL/s in volume or 4 fg/s in mass.
BibTeX:
@article{Gaudo2007,
  author = {Gaudo, M. V. and Abadal, G. and Verd, J. and Teva, J. and Perez-Murano, F. and Costa, E. F. and Montserrat, J. and Uranga, A. and Esteve, J. and Barniol, N.},
  title = {Time-Resolved Evaporation Rate of Attoliter Glycerine Drops Using On-Chip CMOS Mass Sensors Based on Resonant Silicon Micro Cantilevers},
  journal = {#IEEE_J_NANO#},
  year = {2007},
  volume = {6},
  number = {5},
  pages = {509--512},
  doi = {10.1109/TNANO.2007.901477}
}
(2007), "XVIII Jornadas de Paralelismo", September, 2007.
BibTeX:
@proceedings{Ibanez07jjpar,,
  editor = {Pablo Ibáñez and Enrique Torres and Juan Segarra and Jesús Alastruey and Luis Manuel Ramos},
  title = {XVIII Jornadas de Paralelismo},
  year = {2007}
}
L.M. Ramos, J.L. Briz, P.E. Ibáñez and V. Viñals (2007), "Data prefetching in a cache hierarchy with high bandwidth and capacity", SIGARCH Comput. Archit. News. New York, NY, USA Vol. 35(4), pp. 37-44. ACM.
Abstract: In this paper we evaluate four hardware data prefetchers in the context of a high-performance three-level on chip cache hierarchy with high bandwidth and capacity. We consider two classic prefetchers (Sequential Tagged and Stride) and two correlating prefetchers: PC/DC, a recent method with a superior score and low-sized tables, and P-DFCM, a new method. Like PC/DC, P-DFCM focuses on local delta sequences, but it is based on the DFCM value predictor. We explore different prefetch degrees and distances. Running SPEC2000, Olden and IAbench applications, results show that this kind of cache hierarchy turns prefetching aggressiveness into success for the four prefetchers. Sequential Tagged is the best, and deserves further attention to cut it losses in some applications. PC/DC results are matched or even improved by P-DFCM, using far fewer accesses to tables while keeping sizes low.
BibTeX:
@article{Ramos2007,
  author = {Ramos, Luis M. and Briz, José Luis and Ibáñez, Pablo E. and Viñals, Victor},
  title = {Data prefetching in a cache hierarchy with high bandwidth and capacity},
  journal = {SIGARCH Comput. Archit. News},
  publisher = {ACM},
  year = {2007},
  volume = {35},
  number = {4},
  pages = {37--44},
  doi = {10.1145/1327312.1327319}
}
J. Segarra and V. Cholvi (2007), "Convergence of periodic broadcasting and video-on-demand", Computer Communications. Vol. 30(5), pp. 1136-1141. Elsevier.
Abstract: Research on video-on-demand transmissions is essentially divided into periodic broadcasting methods and on-demand methods. Periodic broadcasting is aimed to schedule transmissions off-line, so that an optimized time schedule is achieved. On the other hand video-on-demand has to deal with constraints at requesting times. Thus, studies on these areas have been quite isolated. Obviously, in periodic broadcasting all parameters are known in advance, so timetables can be accurately adjusted and it is assumed transmissions can be arranged to use less bandwidth than video-on-demand. In this paper, we analyze the convergence of both paradigms, showing that the claims that argue that VoD schemes use more bandwidth than PB ones are not necessarily true. We state this argument by proving how to convert any periodic broadcasting method into an on-demand one, which will use equal or less bandwidth. Moreover, we show that this converted on-demand method can also offer shorter serving times.
BibTeX:
@article{Segarra07Convergence,
  author = {Juan Segarra and Vicent Cholvi},
  title = {Convergence of periodic broadcasting and video-on-demand},
  journal = {Computer Communications},
  publisher = {Elsevier},
  year = {2007},
  volume = {30},
  number = {5},
  pages = {1136--1141},
  note = {Advances in Computer Communications Networks},
  url = {http://www.sciencedirect.com/science/article/B6TYP-4MR1HD7-1/2/e7ac99bb58b0117cc755a45a520cd7eb},
  doi = {10.1016/j.comcom.2006.12.007}
}
M. Villarroya-Gaudó, N. Barniol, C. Martin, F. Pérez-Murano, J. Esteve, L. Bruchhaus, R. Jede, E. Bourhis and J. Gierak (2007), "Fabrication of nanogaps for MEMS prototyping using focused ion beam as a lithographic tool and reactive ion etching pattern transfer", Microelectronic Engineering. Vol. 84(5?8), pp. 1215 - 1218.
BibTeX:
@article{Villarroya20071215,
  author = {Maria Villarroya-Gaudó and Nuria Barniol and Cristina Martin and Francesc Pérez-Murano and Jaume Esteve and Lars Bruchhaus and Ralf Jede and Eric Bourhis and Jacques Gierak},
  title = {Fabrication of nanogaps for MEMS prototyping using focused ion beam as a lithographic tool and reactive ion etching pattern transfer},
  journal = {Microelectronic Engineering},
  year = {2007},
  volume = {84},
  number = {5?8},
  pages = {1215 - 1218},
  note = {Proceedings of the 32nd International Conference on Micro- and Nano-Engineering},
  url = {http://www.sciencedirect.com/science/article/pii/S0167931707001323},
  doi = {10.1016/j.mee.2007.01.074}
}
M. Villarroya-Gaudo, E. Figueras, J. Montserrat, J. Verd, J. Teva, G. Abadal, F. Perez-Murano, J. Esteve and N. Barniol (2006), "A platform for monolithic CMOS-MEMS integration on SOI wafers", Journal of Micromechanics and Microengineering. Vol. 16(10), pp. 2203.
Abstract: A new platform for micro- and nano-electromechanical systems based on crystalline silicon as the structural layer in CMOS substrates is presented. This platform is fabricated using silicon on insulator (SOI) substrates, which allows the monolithic integration of the mechanical transducer on crystalline silicon while the characteristics of the structural layer are kept independent from the CMOS technology. We report the design characteristics, the fabrication process and an example of application of the CMOS SOI-MEMS platform to obtain a mass sensor based on a crystalline silicon resonating cantilever.
BibTeX:
@article{0960-1317-16-10-038,
  author = {Villarroya-Gaudo, M. and Eduard Figueras and Josep Montserrat and Jaume Verd and Jordi Teva and Gabriel Abadal and Francesc Perez-Murano and Jaume Esteve and Nuria Barniol},
  title = {A platform for monolithic CMOS-MEMS integration on SOI wafers},
  journal = {Journal of Micromechanics and Microengineering},
  year = {2006},
  volume = {16},
  number = {10},
  pages = {2203},
  url = {http://stacks.iop.org/0960-1317/16/i=10/a=038}
}
J. Alastruey, T. Monreal, V. Viñals and M. Valero (2006), "Speculative early register release", In Proceedings of the 3rd conference on Computing frontiers (CF '06). New York, NY, USA , pp. 291-302. ACM.
Abstract: The late release policy of conventional renaming keeps many registers in the register file assigned in spite of containing values that will never be read in the future. In this work, we study the potential of a novel scheme that speculatively releases a physical register as soon as it has been read by a predicted last instruction that references its value. An auxiliary register file placed outside the critical paths of the processor pipeline holds the early released values just in case they are unexpectedly referenced by some instruction. In addition to demonstrate the feasibility of a last-use predictor, this paper also analyzes the auxiliary register file (latency and size) required to support a speculative early release mechanism that uses a perfect predictor. The obtained results set the performance bound that any real speculative early release implementation is able to reach. We show that in a processor with a 64int+64fp register file, a perfect early release supported by an unbounded auxiliary register file has the potential of speeding up computations up to 23% and 47% for SPECint2000 and SPECfp2000 benchmarks, respectively. Speculative early release can also be used to reduce register file size without losing performance. For instance, a processor with a conventionally managed 96int+96fp register file could be replaced for equal IPC with a 64int+64fp register file managed with perfect early register release and backed with a 64int+64fp auxiliary register file, this representing a 12% IPS (Instructions Per Second) increase if the processor frequency were constrained by the register file access time.
BibTeX:
@inproceedings{Alastruey2006,
  author = {Alastruey, Jesús and Monreal, Teresa and Viñals, Vctor and Valero, Mateo},
  title = {Speculative early register release},
  booktitle = {Proceedings of the 3rd conference on Computing frontiers (CF '06)},
  publisher = {ACM},
  year = {2006},
  pages = {291--302},
  doi = {10.1145/1128022.1128061}
}
J. Alastruey, J.L. Briz, P. Ibañez and V. Viñals (2006), "Software Demand, Hardware Supply", IEEE MICRO., July, 2006. Vol. 26(4), pp. 72-82.
BibTeX:
@article{Alastruey2006a,
  author = {Alastruey, J. and Briz, J. L. and Ibañez, P. and Viñals, V.},
  title = {Software Demand, Hardware Supply},
  journal = {IEEE MICRO},
  year = {2006},
  volume = {26},
  number = {4},
  pages = {72--82},
  doi = {10.1109/MM.2006.80}
}
R. Gran, E. Morancho, A. Olive and J.M. Llabería (2006), "An Enhancement for a Scheduling Logic Pipelined over two Cycles", In Proc. Int. Conf. Computer Design ICCD 2006. , pp. 203-209.
Abstract: Out of order processors use the dynamic scheduling logic both to expose and to exploit parallelism. Pipelining this logic may sacrifice the ability to execute dependent instructions in consecutive cycles. Several previous studies have shown that pipelining the scheduling logic over two cycles degrades performance; our evaluations, in a 4-way machine, on SPEC-2000 integer benchmarks show a performance degradation about 11% compared to an unpipelined scheduling logic. In this work, we present two non-speculative enhancements for a scheduling logic pipelined over two cycles. The idea is computing in advance which instructions will be woken-up by all instructions that are currently competing for selection. Once all of them have been selected, the pre-computed group of instructions can compete for selection in next cycle. The enhancement goal is to tolerate the scheduling-loop latency when not enough ILP is available through the scheduling of dependent instructions in consecutive cycles. Our results in a 4-way machine show that our two proposed enhancements perform, on average, slightly better than two previously proposed speculative schedulers. The performance of our proposals is within a 2.6% and 2% of an unpipelined ideal scheduler.
BibTeX:
@inproceedings{Gran2006,
  author = {Gran, R. and Morancho, E. and Olive, A. and Llabería, J. M.},
  title = {An Enhancement for a Scheduling Logic Pipelined over two Cycles},
  booktitle = {Proc. Int. Conf. Computer Design ICCD 2006},
  year = {2006},
  pages = {203--209},
  doi = {10.1109/ICCD.2006.4380818}
}
L.M. Ramos, J.L. Briz, P.E. Ibáñez and V. Viñals (2006), "Data prefetching in a cache hierarchy with high bandwidth and capacity", In MEDEA '06: Proceedings of the 2006 workshop on MEmory performance. New York, NY, USA , pp. 37-44. ACM.
Abstract: In this paper we evaluate four hardware data prefetchers in the context of a high-performance three-level on chip cache hierarchy with high bandwidth and capacity. We consider two classic prefetchers (Sequential Tagged and Stride) and two correlating prefetchers: PC/DC, a recent method with a superior score and low-sized tables, and P-DFCM, a new method. Like PC/DC, P-DFCM focuses on local delta sequences, but it is based on the DFCM value predictor. We explore different prefetch degrees and distances. Running SPEC2000, Olden and IAbench applications, results show that this kind of cache hierarchy turns prefetching aggressiveness into success for the four prefetchers. Sequential Tagged is the best, and deserves further attention to cut it losses in some applications. PC/DC results are matched or even improved by P-DFCM, using far fewer accesses to tables while keeping sizes low.
BibTeX:
@inproceedings{Ramos2006,
  author = {Ramos, Luis M. and Briz, José Luis and Ibáñez, Pablo E. and Viñals, Víctor},
  title = {Data prefetching in a cache hierarchy with high bandwidth and capacity},
  booktitle = {MEDEA '06: Proceedings of the 2006 workshop on MEmory performance},
  publisher = {ACM},
  year = {2006},
  pages = {37--44},
  doi = {10.1145/1166133.1166138}
}
B. Sahelices, A.d. Dios, P. Ibáñez, V. Viñals and J. Llabería (2006), "Speeding-up Synchronizations in DSM Multiprocessors", In Proceedings of the 12th International Euro-Par Conference (EUROPAR 2006).
Abstract: Synchronization in parallel programs is a major performance bottleneck. Shared data is protected by locks and a lot of time is spent in the competition arising at the lock hand-off. In this period of time, a large amount of traffic is targeted to the line holding the lock variable. In order to be serialized, the requests to the same cache line can either be bounced (NACKed) or buffered in the coherence controller. In this paper we focus on systems whose coherence controllers buffer requests. During lock hand-off only the requests from the winning processor contribute to the computation progress, because the winning processor is the only one that will advance the work. This key observation leads us to propose a hardware mechanism named Request Bypass, which allows requests from the winning processor to bypass the requests buffered in the home coherence controller keeping the lock line. The mechanism does not require compiler or programmer support nor ISA or coherence protocol changes. By simulating a 32 processor system we show that Request Bypass reduces execution time and lock stall time up to 35% and 75%, respectively. The programs limited by synchronization benefit the most from Request Bypass.
BibTeX:
@inproceedings{Sahelices2006,
  author = {Sahelices, B. and Dios, A. de and Ibáñez, P. and Viñals, V. and Llabería, J.M.},
  title = {Speeding-up Synchronizations in DSM Multiprocessors},
  booktitle = {Proceedings of the 12th International Euro-Par Conference (EUROPAR 2006)},
  year = {2006}
}
M. Villarroya-Gaudo M., E. Figueras, J. Verd, J. Teva, G. Abadal, F. Perez-Murano, J. Montserrat, A. Uranga, J. Esteve and N. Barniol (2006), "CMOS-SOI platform for monolithic integration of crystalline silicon MEMS", Electronics Letters. Vol. 42(14), pp. 800-801.
Abstract: A new platform for the fabrication of crystalline micro- and nano-electromechanical systems fully integrable with CMOS is presented. A pre-CMOS process on SOI wafers allows bulk silicon areas for standard CMOS processing and areas with a stack layer of silicon and silicon oxide to be obtained, in which a set of microelectromechanical devices can be fabricated. An integrated resonant beam system with electrical actuation and detection fabricated according to the presented approach is provided.
BibTeX:
@article{Villarroya2006,
  author = {Villarroya-Gaudo, M., M. and Figueras, E. and Verd, J. and Teva, J. and Abadal, G. and Perez-Murano, F. and Montserrat, J. and Uranga, A. and Esteve, J. and Barniol, N.},
  title = {CMOS-SOI platform for monolithic integration of crystalline silicon MEMS},
  journal = {Electronics Letters},
  year = {2006},
  volume = {42},
  number = {14},
  pages = {800--801},
  doi = {10.1049/el:20061097}
}
M. Villarroya-Gaudo, J. Verd, J. Teva, G. Abadal, E. Forsen, F.P. Murano, A. Uranga, E. Figueras, J. Montserrat, J. Esteve, A. Boisen and N. Barniol (2006), "System on chip mass sensor based on polysilicon cantilevers arrays for multiple detection", Sensors and Actuators A: Physical. Vol. 132(1), pp. 154 - 164.
BibTeX:
@article{Villarroya2006154,
  author = {Villarroya-Gaudo, M. and Jaume Verd and Jordi Teva and Gabriel Abadal and Esko Forsen and Francesc Pérez Murano and Arantxa Uranga and Eduard Figueras and Josep Montserrat and Jaume Esteve and Anja Boisen and Núria Barniol},
  title = {System on chip mass sensor based on polysilicon cantilevers arrays for multiple detection},
  journal = {Sensors and Actuators A: Physical},
  year = {2006},
  volume = {132},
  number = {1},
  pages = {154 - 164},
  note = {The 19th European Conference on Solid-State Transducers},
  url = {http://www.sciencedirect.com/science/article/pii/S0924424706002780},
  doi = {10.1016/j.sna.2006.04.002}
}
M.J. Garzarán, M. Prvulovic, J.M. Llabera, V. Viñals, L. Rauchwerger and J. Torrellas (2005), "Tradeoffs in buffering speculative memory state for thread-level speculation in multiprocessors", ACM Trans. Archit. Code Optim.. New York, NY, USA Vol. 2(3), pp. 247-279. ACM.
BibTeX:
@article{Garzaran2005,
  author = {Garzarán, Mara Jesús and Prvulovic, Milos and Llabera, José Mara and Viñals, Vctor and Rauchwerger, Lawrence and Torrellas, Josep},
  title = {Tradeoffs in buffering speculative memory state for thread-level speculation in multiprocessors},
  journal = {ACM Trans. Archit. Code Optim.},
  publisher = {ACM},
  year = {2005},
  volume = {2},
  number = {3},
  pages = {247--279},
  doi = {10.1145/1089008.1089010}
}
E.F. Torres, P. Ibáñez, V. Viñals and J.M. Llaberia (2005), "Store buffer design in first-level multibanked data caches", In Proc. 32nd International Symposium on Computer Architecture (ISCA '05)., June 4--8,, 2005. , pp. 469-480.
BibTeX:
@inproceedings{Torres2005,
  author = {Torres, E. F. and Ibáñez, P. and Viñals, V. and Llaberia, J. M.},
  title = {Store buffer design in first-level multibanked data caches},
  booktitle = {Proc. 32nd International Symposium on Computer Architecture (ISCA '05)},
  year = {2005},
  pages = {469--480},
  doi = {10.1109/ISCA.2005.47}
}
J. Verd, G. Abadal, J. Teva, M. Villarroya-Gaudo, A. Uranga, X. Borrise, F. Campabadal, J. Esteve, E.F. Costa, F. Perez-Murano, Z.J. Davis, E. Forsen, A. Boisen and N. Barniol (2005), "Design, fabrication, and characterization of a submicroelectromechanical resonator with monolithically integrated CMOS readout circuit", #IEEE_J_MEMS#. Vol. 14(3), pp. 508-519.
Abstract: In this paper, we report on the main aspects of the design, fabrication, and performance of a microelectromechanical system constituted by a mechanical submicrometer scale resonator (cantilever) and the readout circuitry used for monitoring its oscillation through the detection of the capacitive current. The CMOS circuitry is monolithically integrated with the mechanical resonator by a technology that allows the combination of standard CMOS processes and novel nanofabrication methods. The integrated system constitutes an example of a submicroelectromechanical system to be used as a cantilever-based mass sensor with both a high sensitivity and a high spatial resolution (on the order of 10-18 g and 300 nm, respectively). Experimental results on the electrical characterization of the resonance curve of the cantilever through the integrated CMOS readout circuit are shown.
BibTeX:
@article{Verd2005,
  author = {Verd, J. and Abadal, G. and Teva, J. and Villarroya-Gaudo, M. and Uranga, A. and Borrise, X. and Campabadal, F. and Esteve, J. and Costa, E. F. and Perez-Murano, F. and Davis, Z. J. and Forsen, E. and Boisen, A. and Barniol, N.},
  title = {Design, fabrication, and characterization of a submicroelectromechanical resonator with monolithically integrated CMOS readout circuit},
  journal = {#IEEE_J_MEMS#},
  year = {2005},
  volume = {14},
  number = {3},
  pages = {508--519},
  doi = {10.1109/JMEMS.2005.844845}
}
M. Villarroya-Gaudo, J. Verd, J. Teva, G. Abadal, F. Perez, J. Esteve and N. Barniol (2005), "Cantilever based MEMS for multiple mass sensing", In Proc. PhD Research in Microelectronics and Electronics. Vol. 1, pp. 197-200.
Abstract: A cantilever based micro electro mechanical system (MEMS) for mass detection is presented. The sensor for multiple detections is composed by several cantilevers in an array configuration integrated monolithically with CMOS. Cantilevers are excited electrostatically to its resonance frequency. The oscillation of the microcantilever is detected by a capacitive detection technique. Mass variation is detected by resonance frequency shifting. The mechanical transducers are fabricated after CMOS process on polysilicon, one of the CMOS layers. Optical lithography is used for the cantilevers definitions. Cantilevers of 50 μm length, 1.1 μm wide and 600 nm thick have been defined. This sensor provides a mass sensitivity of 70 ag/Hz.
BibTeX:
@inproceedings{Villarroya2005,
  author = {Villarroya-Gaudo, M. and Verd, J. and Teva, J. and Abadal, G. and Perez, F. and Esteve, J. and Barniol, N.},
  title = {Cantilever based MEMS for multiple mass sensing},
  booktitle = {Proc. PhD Research in Microelectronics and Electronics},
  year = {2005},
  volume = {1},
  pages = {197--200},
  doi = {10.1109/RME.2005.1543038}
}
M. Villarroya-Gaudo, J. Verd, J. Teva, G. Abadal, E. Figueras, F. Perez-Murano, J. Esteve and N. Barniol (2005), "Sensor based on arrays of sub-micrometer scale resonant silicon cantilevers integrated monolithically with CMOS circuitry", In Proc. Spanish Conf. Electron Devices. , pp. 603-606.
Abstract: A mass sensor, based on arrays of cantilevers for multiple detection, is presented. Excitation and readout is performed electrostatically. Polysilicon cantilevers are integrated monolithically with CMOS in a compatible process. Integration of arrays of cantilever allows performing multiple detections on the same device. A multiplexing system to select individual cantilevers is implemented as well as a scheme based on two readout circuits for differential measurements. A mass resolution of 5·10 -18 g has been achieved in the first working devices.
BibTeX:
@inproceedings{Villarroya2005a,
  author = {Villarroya-Gaudo, M. and Verd, J. and Teva, J. and Abadal, G. and Figueras, E. and Perez-Murano, F. and Esteve, J. and Barniol, N.},
  title = {Sensor based on arrays of sub-micrometer scale resonant silicon cantilevers integrated monolithically with CMOS circuitry},
  booktitle = {Proc. Spanish Conf. Electron Devices},
  year = {2005},
  pages = {603--606},
  doi = {10.1109/SCED.2005.1504530}
}
M. Villarroya-Gaudo, F. Peez-Murano, C. Martin, Z. Davis, A. Boisen, J. Esteve, E. Figueras, J. Montserrat and N. Barniol (2004), "AFM lithography for the definition of nanometre scale gaps: application to the fabrication of a cantilever-based sensor with electrochemical current detection", Nanotechnology. Vol. 15(7), pp. 771.
Abstract: The concept, design and fabrication of a cantilever-based sensor operating in liquid for biochemical applications are reported. A novel approach for detecting the deflection of a functionalized cantilever is proposed. It consists of detecting the change of the electrochemical current level when a voltage is applied between a deflecting cantilever, acting as one of the electrodes, and a reference fixed electrode placed in close proximity to the free extreme of the cantilever. The detection is possible since the distance between the two electrodes is smaller than 50 nm. The sensor is fabricated by using a combination of MEMS technology and AFM-based lithography.
BibTeX:
@article{0957-4484-15-7-009,
  author = {Villarroya-Gaudo, M. and Francesc Peez-Murano and Cristina Martin and Zachary Davis and Anja Boisen and Jaume Esteve and Eduard Figueras and Josep Montserrat and Nuria Barniol},
  title = {AFM lithography for the definition of nanometre scale gaps: application to the fabrication of a cantilever-based sensor with electrochemical current detection},
  journal = {Nanotechnology},
  year = {2004},
  volume = {15},
  number = {7},
  pages = {771},
  url = {http://stacks.iop.org/0957-4484/15/i=7/a=009}
}
T. Monreal, V. Viñals, J. Gonzalez, A. Gonzalez and M. Valero (2004), "Late allocation and early release of physical registers", Transactions on Computers. Vol. 53(10), pp. 1244-1259.
Abstract: The register file is one of the critical components of current processors in terms of access time and power consumption. Among other things, the potential to exploit instruction-level parallelism is closely related to the size and number of ports of the register file. In conventional register renaming schemes, both register allocation and releasing are conservatively done, the former at the rename stage, before registers are loaded with values, and the latter at the commit stage of the instruction redefining the same register, once registers are not used any more. We introduce VP-LAER, a renaming scheme that allocates registers later and releases them earlier than conventional schemes. Specifically, physical registers are allocated at the end of the execution stage and released as soon as the processor realizes that there will be no further use of them. VP-LAER enhances register utilization, that is, the fraction of allocated registers having a value to be read in the future. Detailed cycle-level simulations show either a significant speedup for a given register file size or a reduction in the register file size for a given performance level, especially for floating-point codes, where the register file pressure is usually high.
BibTeX:
@article{Monreal2004,
  author = {Monreal, T. and Viñals, V. and Gonzalez, J. and Gonzalez, A. and Valero, M.},
  title = {Late allocation and early release of physical registers},
  journal = {Transactions on Computers},
  year = {2004},
  volume = {53},
  number = {10},
  pages = {1244--1259},
  doi = {10.1109/TC.2004.79}
}
E. Torres, P. Ibáñez, V. Viñals and J. Llabería (2004), "Contents Management in First-Level Multibanked Data Caches", In 10th International Euro-Par Conference, LNCS 3149. Vol. 3149/2004, pp. 516-524. Springer Berlin / Heidelberg.
Abstract: High-performance processors will increasingly rely on multibanked first-level caches to meet frequency requirements. In this paper we introduce replication degree and data distribution as the main multibanking design axes. We sample this design space by selecting current data distribution policy proposals, measuring them on a detailed model of a deep pipelined processor and evaluating the trade-off introduced when the replication degree is taken into account. We find that the best design points use data address interleaving policies and several degrees of bank replication.
BibTeX:
@inproceedings{Torres2004,
  author = {E. Torres and P. Ibáñez and V. Viñals and J.M. Llabería},
  editor = {LNCS , Springer Berlin / Heidelberg},
  title = {Contents Management in First-Level Multibanked Data Caches},
  booktitle = {10th International Euro-Par Conference, LNCS 3149},
  publisher = {Springer Berlin / Heidelberg},
  year = {2004},
  volume = {3149/2004},
  pages = {516-524},
  doi = {10.1007/978-3-540-27866-5_68}
}
M.J. Garzaran, M. Prvulovic, J.M. Llaberia, V. Viñals, L. Rauchwerger and J. Torrellas (2003), "Tradeoffs in buffering memory state for thread-level speculation in multiprocessors", In Proc. Ninth Int. Symp. High-Performance Computer Architecture HPCA-9 2003. , pp. 191-202.
Abstract: Thread-level speculation provides architectural support to aggressively run hard-to-analyze code in parallel. As speculative tasks run concurrently, they generate unsafe or speculative memory state that needs to be separately buffered and managed in the presence of distributed caches and buffers. Such state may contain multiple versions of the same variable. In this paper, we introduce a novel taxonomy of approaches to buffering and managing multi-version speculative memory state in multiprocessors. We also present a detailed complexity-benefit tradeoff analysis of the different approaches. Finally, we use numerical applications to evaluate the performance of the approaches under a single architectural framework. Our key insights are that support for buffering the state of multiple speculative tasks and versions per processor is more complexity-effective than support for merging the state of tasks with main memory lazily. Moreover, both supports can be gainfully combined and, in large machines, their effect is nearly fully additive. Finally, the more complex support for future state in main memory can boost performance when buffers are under pressure, but hurts performance when squashes are frequent.
BibTeX:
@inproceedings{Garzaran2003b,
  author = {Garzaran, M. J. and Prvulovic, M. and Llaberia, J. M. and Viñals, V. and Rauchwerger, L. and Torrellas, J.},
  title = {Tradeoffs in buffering memory state for thread-level speculation in multiprocessors},
  booktitle = {Proc. Ninth Int. Symp. High-Performance Computer Architecture HPCA-9 2003},
  year = {2003},
  pages = {191--202},
  doi = {10.1109/HPCA.2003.1183537}
}
M.J. Garzaran, M. Prvulovic, V. Viñals, J.M. Llaberia, L. Rauchwerger and J. Torrellas (2003), "Using software logging to support multiversion buffering in thread-level speculation", In Proc. 12th Int. Conf. Parallel Architectures and Compilation Techniques PACT 2003. , pp. 170-181.
Abstract: In thread-level speculation (TLS), speculative tasks generate memory state that cannot simply be combined with the rest of the system because it is unsafe. One way to deal with this difficulty is to allow speculative state to merge with memory but back up in an undo log the data that will be overwritten. Such undo log can be used to roll back to a safe state if a violation occurs. This approach is said to use future main memory (FMM), as memory keeps the most speculative state. While the aggressive approach of FMM systems often delivers better performance than more conservative approaches, it also requires additional hardware support. To simplify the design of FMM systems, we propose a software-only design for the undo log system. We show that an FMM system with software logging is a good design point: the design has less implementation complexity than an FMM system with hardware logs, and it only reduces performance moderately. In particular, in a simulated 16-processor machine, applications take only 10% longer to execute than if the system had the logging system fully implemented in hardware.
BibTeX:
@inproceedings{Garzaran2003c,
  author = {Garzaran, M. J. and Prvulovic, M. and Viñals, V. and Llaberia, J. M. and Rauchwerger, L. and Torrellas, J.},
  title = {Using software logging to support multiversion buffering in thread-level speculation},
  booktitle = {Proc. 12th Int. Conf. Parallel Architectures and Compilation Techniques PACT 2003},
  year = {2003},
  pages = {170--181},
  doi = {10.1109/PACT.2003.1238013}
}
E. Torres, P. Ibáñez, V. Viñals and J. Llabería (2003), "Counteracting Bank Missprediction in Sliced First-Level Caches", In 9th International Euro-Par Conference, LNCS 2790. Vol. Volume 1 / 1973 - Volume 6550 / 2011 Springer Berlin / Heidelberg.
Abstract: Future processors having sliced memory pipelines will rely on bank prediction to schedule memory instructions to a first-level cache split into banks. In a deeply pipelined processor, even a small bank misprediction rate may degrade performance severely. The goal of this paper is to counteract the bank misprediction penalty, so that in spite of such bank misprediction, performance suffers little. Our contribution is twofold: a new recovery scheme for latency misprediction, and two policies for selectively replicating loads to all banks. The proposals have been evaluated for 4 and 8-way superscalar processors and a wide range of pipeline depths. The best combination of our mechanisms improves IPC of an 8-way baseline processor up to 11%, removing up to two thirds of the bank misprediction penalty.
BibTeX:
@inproceedings{Torres2003,
  author = {E. Torres and P. Ibáñez and V. Viñals and J.M. Llabería},
  editor = {LNCS , Springer Berlin / Heidelberg},
  title = {Counteracting Bank Missprediction in Sliced First-Level Caches},
  booktitle = {9th International Euro-Par Conference, LNCS 2790},
  publisher = {Springer Berlin / Heidelberg},
  year = {2003},
  volume = {Volume 1 / 1973 - Volume 6550 / 2011},
  doi = {10.1007/978-3-540-45209-6_83}
}
F. Dang, M. Jesus Garzaran, M. Prvulovic, Y. Zhang, A. Jula, H. Yu, N. Amato, L. Rauchwerger and J. Torrellas (2002), "Smartapps, an application centric approach to high performance computing: compiler-assisted software and hardware support for reduction operations", In Abstracts and CD-ROM Parallel and Distributed Processing Symposium., International, IPDPS 2002. , pp. 172-181.
BibTeX:
@inproceedings{Dang2002,
  author = {Dang, F. and Jesus Garzaran, M. and Prvulovic, M. and Zhang, Ye and Jula, A. and Yu, Hao and Amato, N. and Rauchwerger, L. and Torrellas, J.},
  title = {Smartapps, an application centric approach to high performance computing: compiler-assisted software and hardware support for reduction operations},
  booktitle = {Abstracts and CD-ROM Parallel and Distributed Processing Symposium., International, IPDPS 2002},
  year = {2002},
  pages = {172--181},
  doi = {10.1109/IPDPS.2002.1016572}
}
T. Monreal, V. Viñals, A. Gonzalez and M. Valero (2002), "Hardware schemes for early register release", In Proc. International Conference on Parallel Processing. , pp. 5-13.
Abstract: Register files are becoming one of the critical components of current out-of-order processors in terms of delay and power consumption, since their potential to exploit instruction-level parallelism is quite related to the size and number of ports of the register file. In conventional register renaming schemes, register releasing is conservatively done only after the instruction that redefines the same register is committed. Instead, we propose a scheme that releases registers as soon as the processor knows that there will be no further use of them. We present two early releasing hardware implementations with different performance/complexity trade-offs. Detailed cycle-level simulations show either a significant speedup for a given register file size, or a reduction in register file size for a given performance level.
BibTeX:
@inproceedings{Monreal2002,
  author = {Monreal, T. and Viñals, V. and Gonzalez, A. and Valero, M.},
  title = {Hardware schemes for early register release},
  booktitle = {Proc. International Conference on Parallel Processing},
  year = {2002},
  pages = {5--13},
  doi = {10.1109/ICPP.2002.1040854}
}
M.J. Garzaran, J.L. Briz, P.E. Ibañez and V. Viñals (2001), "Hardware prefetching in bus-based multiprocessors: pattern characterization and cost-effective hardware", In Proc. Ninth Euromicro Workshop on Parallel and Distributed Processing. , pp. 345-354.
Abstract: Data prefetching has been widely studied as a technique to hide memory access latency in multiprocessors. Most recent research on hardware prefetching focuses either on uniprocessors, or on distributed shared memory (DSM) and other non bus-based organizations. However, in the context of bus-based SMPs, prefetching poses a number of problems related to the lack of scalability and limited bus bandwidth of these modest-sized machines. This paper considers how the number of processors and the memory access patterns in the program influence the relative performance of sequential and non-sequential prefetching mechanisms in a bus-based SMP. We compare the performance of four inexpensive hardware prefetching techniques, varying the number of processors. After a breakdown of the results based on a performance model, we propose a cost-effective hardware prefetching solution for implementing on such modest-sized multiprocessors
BibTeX:
@inproceedings{Garzaran2001,
  author = {Garzaran, M. J. and Briz, J. L. and Ibañez, P. E. and Viñals, V.},
  title = {Hardware prefetching in bus-based multiprocessors: pattern characterization and cost-effective hardware},
  booktitle = {Proc. Ninth Euromicro Workshop on Parallel and Distributed Processing},
  year = {2001},
  pages = {345--354},
  doi = {10.1109/EMPDP.2001.905061}
}
M.J. Garzaran, M. Prvulovic, Y. Zhang, A. Jula, H. Yu, L. Rauchwerger and J. Torrellas (2001), "Architectural support for parallel reductions in scalable shared-memory multiprocessors", In Proc. International Conference on Parallel Architectures and Compilation Techniques. , pp. 243-254.
Abstract: Reductions are important and time-consuming operations in many scientific codes. Effective parallelization of reductions is a critical transformation for loop parallelization, especially for sparse, dynamic applications. Unfortunately, conventional reduction parallelization algorithms are not scalable. In this paper, we present new architectural support that significantly speeds up parallel reduction and makes it scalable in shared-memory multiprocessors. The required architectural changes are mostly confined to the directory controllers. Experimental results based on simulations show that the proposed support is very effective. While conventional software-only reduction parallelization delivers average speedups of only 2.7 for 16 processors, our scheme delivers average speedups of 7.6
BibTeX:
@inproceedings{Garzaran2001b,
  author = {Garzaran, M. J. and Prvulovic, M. and Zhang, Ye and Jula, A. and Yu, Hao and Rauchwerger, L. and Torrellas, J.},
  title = {Architectural support for parallel reductions in scalable shared-memory multiprocessors},
  booktitle = {Proc. International Conference on Parallel Architectures and Compilation Techniques},
  year = {2001},
  pages = {243--254},
  doi = {10.1109/PACT.2001.953304}
}
M. Prvulovic, M.J. Garzaran, L. Rauchwerger and J. Torrellas (2001), "Removing architectural bottlenecks to the scalability of speculative parallelization", In Proc. 28th Annual International Symposium on Computer Architecture. , pp. 204-215.
Abstract: Speculative thread-level parallelization is a promising way to speed up codes that compilers fail to parallelize. While several speculative parallelization schemes have been proposed for different machine sizes and types of codes, the results so far show that it is hard to deliver scalable speedups. Often, the problem is not true dependence violations, but sub-optimal architectural design. Consequently, we attempt to identify and eliminate major architectural bottlenecks that limit the scalability of speculative parallelization. The solutions that we propose are: low-complexity commit in constant time to eliminate the task commit bottleneck, a memory-based overflow area to eliminate stall due to speculative buffer overflow, and exploiting high-level access patterns to minimize speculation-induced traffic. To show that the resulting system is truly scalable, we perform simulations with up to 128 processors. With our optimizations, the speedups for 128 and 64 processors reach 63 and 48, respectively. The average speedup for 64 processors is 32, nearly four times higher than without our optimizations
BibTeX:
@inproceedings{Prvulovic2001,
  author = {Prvulovic, M. and Garzaran, M. J. and Rauchwerger, L. and Torrellas, J.},
  title = {Removing architectural bottlenecks to the scalability of speculative parallelization},
  booktitle = {Proc. 28th Annual International Symposium on Computer Architecture},
  year = {2001},
  pages = {204--215},
  doi = {10.1109/ISCA.2001.937450}
}
L. Ramos, P. Ibáñez, V. Viñals and J.M. Llabería (2000), "Modeling load address behaviour through recurrences", In Proc. ISPASS Performance Analysis of Systems and Software 2000 IEEE Int. Symp. , pp. 101-108.
Abstract: Addresses of load instructions exhibit regularity in their behaviour which is modelled through several models (locality repetitive patterns, etc.) and exploited in processor and memory hierarchy design. Nevertheless, sparse and symbolic applications are intensive in addressing patterns not entirely covered by current models. In this work we introduce a new recurrence among load pairs called “linear link” in order to identify more regularity from such applications. A linear link is a type of recurrence between the value read by a (producer) load and the address issued by a (consumer) load, which is detected tracking on-the-fly dependencies among loads. We consider a broad workload (Nas, Olden, Perfect, Spec95 and IAbench) and conclude that linear links together with stride recurrences can identify many address streams in symbolic and scientific applications traversing either dense, linked data structures or compressed forms of sparse arrays. The two recurrence combinations identify more than 90% of the addresses in more than a half the programs (in 24 our of 55), and more than 75% of the addresses in 90% of the programs (50 our of 55). Finally, we show several measures related to the use of linear links as address predictors for executing loads speculatively and for issuing data prefetches (prediction distance ahead capacity, etc.)
BibTeX:
@inproceedings{Ramos2000a,
  author = {Ramos, L. and Ibáñez, P. and Viñals, V. and Llabería, J. M.},
  title = {Modeling load address behaviour through recurrences},
  booktitle = {Proc. ISPASS Performance Analysis of Systems and Software 2000 IEEE Int. Symp},
  year = {2000},
  pages = {101--108},
  doi = {10.1109/ISPASS.2000.842288}
}
T. Monreal, A. Gonzalez, M. Valero, J. Gonzalez and V. Viñals (1999), "Delaying physical register allocation through virtual-physical registers", In Proc. 32nd Annual International Symposium on MICRO-32 Microarchitecture. , pp. 186-192.
Abstract: Register file access time represents one of the critical delays of current microprocessors, and it is expected to become more critical as future processors increase the instruction window size and the issue width. This paper presents a novel physical register management scheme that allows for a late allocation (at the end of execution) of registers. We show that it can provide significant savings in number of registers and thus, it can significantly shorten the register file access time. The approach is based on virtual-physical registers, which we presented in a previous work, extended with a new register allocation policy. This policy consists of an on-demand allocation in order to maximize the register usage, combined with a stealing mechanism that prevents older instruction from being delayed by younger ones. This shortens the average number of cycles that each physical register is allocated, and allows for an early execution of instructions since they can obtain a physical register for its destination earlier than with the conventional scheme. Early execution is especially beneficial for branches and memory operations, since the former can be resolved earlier and the latter can prefetch their data in advance
BibTeX:
@inproceedings{Monreal1999,
  author = {Monreal, T. and Gonzalez, A. and Valero, M. and Gonzalez, J. and Viñals, V.},
  title = {Delaying physical register allocation through virtual-physical registers},
  booktitle = {Proc. 32nd Annual International Symposium on MICRO-32 Microarchitecture},
  year = {1999},
  pages = {186--192},
  doi = {10.1109/MICRO.1999.809456}
}
P. Ibáñez, V. Viñals, J.L. Briz and M.J. Garzarán (1998), "Characterization and improvement of load/store cache-based prefetching", In ICS '98: Proceedings of the 12th international conference on Supercomputing. New York, NY, USA , pp. 369-376. ACM.
BibTeX:
@inproceedings{Ibanez1998,
  author = {Ibáñez, Pablo and Viñals, Víctor and Briz, José Luis and Garzarán, María Jesús},
  title = {Characterization and improvement of load/store cache-based prefetching},
  booktitle = {ICS '98: Proceedings of the 12th international conference on Supercomputing},
  publisher = {ACM},
  year = {1998},
  pages = {369--376},
  doi = {10.1145/277830.277921}
}
A. Gonzalez, M. Valero, J. Gonzalez and T. Monreal (1997), "Virtual registers", In Proc. Fourth International Conference on High-Performance Computing. , pp. 364-369.
Abstract: The number of physical registers is one of the critical issues of current superscalar out-of-order processors. Conventional architectures allocate, in the decoding stage, a new storage location (e.g. a physical register) for each operation that has a destination register. When an instruction is committed, it frees the physical register allocated to the previous instruction that had the same destination logical register. Thus, an additional register (i.e. in addition to the number of logical registers) is used for each instruction with a destination register from the time it is decoded until it is committed. In this paper, we propose a novel register organization that allocates physical registers when instructions complete their execution. In this way, the register pressure is significantly reduced, since the additional register is only used from the time execution completes until the instruction is committed. For some long-latency instructions (e.g. load with a cache miss) and for parts of the code with a small amount of parallelism, the savings could be very high. We have evaluated the new scheme for a superscalar processor and obtained a significant speedup
BibTeX:
@inproceedings{Gonzalez1997,
  author = {Gonzalez, A. and Valero, M. and Gonzalez, J. and Monreal, T.},
  title = {Virtual registers},
  booktitle = {Proc. Fourth International Conference on High-Performance Computing},
  year = {1997},
  pages = {364--369},
  doi = {10.1109/HIPC.1997.634516}
}
P. Ibañez and V. Viñals (1996), "Performance assessment of contents management in multilevel on-chip caches", In Proc. 22nd EUROMICRO Conf. EUROMICRO 96. 'Beyond 2000: Hardware and Software Design Strategies'. , pp. 431-440.
Abstract: This paper deals with two level on-chip cache memories. We show the impact of three different relationships between the contents of these levels on the system performance. In addition to the classical Inclusion contents management, we propose two alternatives, namely Exclusion and Demand, developing for them the necessary coherence support and quantifying their relative performance in a design space (sizes, latencies, ...) in agreement with the constraints imposed by integration. Two performance metrics are considered: the second-level cache miss ratio and the system CPI. The experiments have been carried out running a set of integer and floating point SPEC'92 benchmarks. We conclude showing the superiority of our improved version of Exclusion throughout all the sizing and workload spectrum studied
BibTeX:
@inproceedings{Ibanez1996,
  author = {Ibañez, P. and Viñals, V.},
  editor = {IEEE Computer Society Press. ISBN: 0-8186-7487-3},
  title = {Performance assessment of contents management in multilevel on-chip caches},
  booktitle = {Proc. 22nd EUROMICRO Conf. EUROMICRO 96. 'Beyond 2000: Hardware and Software Design Strategies'},
  year = {1996},
  pages = {431--440},
  doi = {10.1109/EURMIC.1996.546467}
}
L. Jimeno, P. Ibáñez and V. Viñals (1996), "Warm Time Sampling: Fast and Accurate Cycle-Level Simulation of Cache Memory", In 22nd Euromicro Conference. Short Contributions. , pp. 39-44.
Abstract: This paper proposes a new technique for reducing cache memory simulation time when measuring CPI We perform timesampling simulation but still use the parts of the trace that do not belong to the sample to update the state of the memory system in order to avoid coldstart problems at the beginning of the next simulated interval In our simulation environment and using this warmup technique we achieve a reduction by a factor of  in the elapsed simulation time with an error less than  in the CPI estimation
BibTeX:
@inproceedings{Jimeno1996,
  author = {L. Jimeno and P. Ibáñez and V. Viñals},
  editor = {IEEE Computer Society Press. ISBN: 0-8186-7703-1},
  title = {Warm Time Sampling: Fast and Accurate Cycle-Level Simulation of Cache Memory},
  booktitle = {22nd Euromicro Conference. Short Contributions},
  year = {1996},
  pages = {39-44}
}

Created by JabRef on 26/11/2014.